Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020).

This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020

The dataset contains 24 EN-X TMX files, where X is a CEF language (17617914 TUs in total). New methods for​ text extraction from pdf,​ sentence splitting,​ sentence alignment​, and parallel corpus filtering​ have been applied. The following list holds the number of TUs per EN-X language pair:
bg-en​ 772699​
cs-en​ 779082​
da-en​ 775675​
de-en​ 760573​
el-en​ 781987​
es-en​ 777371​
et-en​ 769067​
fi-en​ 753743​
fr-en​ 773622​
hr-en​ 650029​
hu-en​ 772358​
is-en​ 542623​
it-en​ 778598​
lt-en​ 764030​
lv-en​ 783489​
mt-en​ 410809​
nl-en​ 762433​
no-en​ 581379​
pl-en​ 762903​
pt-en​ 775623​
ro-en​ 783741​
sk-en​ 780097​
sl-en​ 766138​
sv-en​ 759845​

DSI Relevance: eHealth