Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020), provided in Moses format.

This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020

The dataset contains 24 EN-X Moses (pair-) files, where X is a CEF language (17617914 TUs in total). New methods for​ text extraction from pdf,​ sentence splitting,​ sentence alignment​, and parallel corpus filtering​ have been applied. The following list holds the number of TUs per EN-X languageRead More

DSI Relevance: eHealth