Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020).
This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020
The dataset contains 24 EN-X TMX files, where X is a CEF language (17617914 TUs in total). New methods for text extraction from pdf, sentence splitting, sentence alignment, and parallel corpus filtering have been applied. The following list holds the number of TUs per EN-X language pair:
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
- Multilingual corpus from the Publications Office of the EU on the medical domain
- Multilingual corpus from the European Vaccination Information Portal
- COVID-19 ANTIBIOTIC dataset. Multilingual (CEF languages)
- Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020) (EN-IS).