Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020), provided in Moses format.

16 Last view: 2024-08-27

4 Last download: 2024-08-08

Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020), provided in Moses format.

Attribution details: This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020

The dataset contains 24 EN-X Moses (pair-) files, where X is a CEF language (17617914 TUs in total). New methods for text extraction from pdf, sentence splitting, sentence alignment, and parallel corpus filtering have been applied. The following list holds the number of TUs per EN-X language pair:
bg-en 772699
cs-en 779082
da-en 775675
de-en 760573
el-en 781987
es-en 777371
et-en 769067
fi-en 753743
fr-en 773622
hr-en 650029
hu-en 772358
is-en 542623
it-en 778598
lt-en 764030
lv-en 783489
mt-en 410809
nl-en 762433
no-en 581379
pl-en 762903
pt-en 775623
ro-en 783741
sk-en 780097
sl-en 766138
sv-en 759845

DSI Relevance: eHealth

Distribution

Availability: Available

Licences

CC-BY-4.0

Conditions: Attribution

Distribution Details

Attribution Details: This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020

Contact Person

Prokopis Prokopidis

text

Multilingual text corpusLanguages

Swedish (sv)

Romanian; Moldavian; Moldovan (ro)

Slovak (sk)

Slovenian (sl)

Spanish; Castilian (es)

English (en)

Dutch; Flemish (nl)

Danish (da)

Finnish (fi)

Estonian (et)

Bulgarian (bg)

Croatian (hr)

Czech (cs)

German (de)

French (fr)

Icelandic (is)

Hungarian (hu)

Latvian (lv)

Italian (it)

Maltese (mt)

Lithuanian (lt)

Norwegian Bokmål (nb)

Modern Greek (1453-) (el)

Portuguese (pt)

Polish (pl)

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel

Text Format

Size

17,617,914 Translation Units

Character encoding

UTF-8

Domains

SOCIAL QUESTIONS Health (Eurovoc 2841)

Resource Creation

Created using ELRC Services

Funding Project

European Language Resource Coordination 3.0 (ELRC3.0 - SMART 2019/1083 LC-01325001)

URL: http://www.lr-coordi...

Funding Type: Eu Funds

Funder: European Commission

Funding Country: European Union (EU)

Metadata

Created: 06/11/2019

Last Updated: 23/04/2020

Metadata Language: English (en)

Version

Version: 1.0

Relations

Relation Type: Is Converted Version Of

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following:

Resources from the same project