Biomedical English Corpus

The Biomedical English Corpus is a 601-million-token corpus of English for the biomedical domain. It consists of the English counterpart of the MeSpEn corpus (https://temu.bsc.es/mespen/), as well as the UFAL corpus (https://ufal.mff.cuni.cz/ufal_medical_corpus). Also, we included the English part of the EMEA (https://opus.nlpl.eu/EMEA.php) and BARR2 datasets (https://temu.bsc.es/BARR2/datasets.html). It consists of 601,611,211 tokens, 39,127,771 sentences and 36,543,449 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (http://ixa2.si.ehu.eus/mt4all/project)
We license the actual packaging of this data under a CC0 1.0 Universal License.