Biomedical English Corpus
The Biomedical English Corpus is a 601-million-token corpus of English for the biomedical domain. It consists of the English counterpart of the MeSpEn corpus (https://temu.bsc.es/mespen/), as well as the UFAL corpus (https://ufal.mff.cuni.cz/ufal_medical_corpus). Also, we included the English part of the EMEA (https://opus.nlpl.eu/EMEA.php) and BARR2 datasets (https://temu.bsc.es/BARR2/datasets.html). It consists of 601,611,211 tokens, 39,127,771 sentences and 36,543,449 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (http://ixa2.si.ehu.eus/mt4all/project)
We license the actual packaging of this data under a CC0 1.0 Universal License.
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
- Bilingual corpus from the European Vaccination Portal (GA-EN)
- Bilingual corpus from the Publications Office of the EU on the medical domain (EN-DE)
- Compilation of Czech-Hungarian parallel corpora resources used for training of NTEU Machine Translation engines.
- Compilation of Czech-Finnish parallel corpora resources used for training of NTEU Machine Translation engines.