Biomedical English Corpus

The Biomedical English Corpus is a 601-million-token corpus of English for the biomedical domain. It consists of the English counterpart of the MeSpEn corpus (, as well as the UFAL corpus ( Also, we included the English part of the EMEA ( and BARR2 datasets ( It consists of 601,611,211 tokens, 39,127,771 sentences and 36,543,449 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (
We license the actual packaging of this data under a CC0 1.0 Universal License.