ELRC3.0 Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020).

This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020

The dataset contains 300 X-Y TMX files, where X and Y are CEF languages (180312670 TUs in total). New methods for​ text extraction from pdf,​ sentence splitting,​ sentence alignment​, and parallel corpus filtering​ have been applied. The following list holds the number of TUs per language pair:
bg-Read More

DSI Relevance: eHealth