Biomedical English Corpus – ELRC-SHARE

66 Last view: 2025-05-22

1 Last update: 2021-11-30

29 Last download: 2025-03-20

Biomedical English Corpus

The Biomedical English Corpus is a 601-million-token corpus of English for the biomedical domain. It consists of the English counterpart of the MeSpEn corpus (https://temu.bsc.es/mespen/), as well as the UFAL corpus (https://ufal.mff.cuni.cz/ufal_medical_corpus). Also, we included the English part of the EMEA (https://opus.nlpl.eu/EMEA.php) and BARR2 datasets (https://temu.bsc.es/BARR2/datasets.html). It consists of 601,611,211 tokens, 39,127,771 sentences and 36,543,449 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (http://ixa2.si.ehu.eus/mt4all/project)
We license the actual packaging of this data under a CC0 1.0 Universal License.

Distribution

Availability: Available

Licences

Distribution Details

Contact Person

Ona de Gibert Bonet

text

Monolingual text corpusLanguages

English (en)

Linguality

Linguality type: Monolingual

Text Format

Plain Text

Size

601,611,211 Tokens

Resource Creation

Funding Project

Unsupervised MT for Low-resourced language pairs (MT4All)

Funding Type: Eu Funds

Funding Country: European Union (EU)

Metadata

Created: 26/11/2021

Last Updated: 26/11/2021

Metadata Language: English (en)

People who looked at this resource also viewed the following:

Resources from the same project