ELRC3.0 Multilingual corpus made out of PDF documents from the European Medicines Agency (EMEA), https://www.ema.europa.eu, (February 2020).
This dataset has been generated out of public content available through European Medicines Agency: https://www.ema.europa.eu/, in February 2020
The dataset contains 300 X-Y TMX files, where X and Y are CEF languages (180312670 TUs in total). New methods for text extraction from pdf, sentence splitting, sentence alignment, and parallel corpus filtering have been applied. The following list holds the number of TUs per language pair:
bg-cs 688269
bg-da 680627
bg-de 671594
bg-el 686015
bg-es 691343
bg-et 660011
bg-fi 639777
bg-fr 685017
bg-hr 585893
bg-hu 661458
bg-is 422560
bg-it 693014
bg-lt 643368
bg-lv 685087
bg-mt 176930
bg-nb 507090
bg-nl 677089
bg-pl 661682
bg-pt 693698
bg-ro 712420
bg-sk 696492
bg-sl 669698
bg-sv 674691
cs-da 688471
cs-de 678731
cs-el 684186
cs-es 694531
cs-et 675192
cs-fi 655396
cs-fr 689273
cs-hr 583522
cs-hu 671978
cs-is 397968
cs-it 696151
cs-lt 660199
cs-lv 699358
cs-mt 177094
cs-nb 503658
cs-nl 684297
cs-pl 677638
cs-pt 698405
cs-ro 703225
cs-sk 716877
cs-sl 695006
cs-sv 680890
da-de 684272
da-el 679773
da-es 699194
da-et 667032
da-fi 661362
da-fr 692225
da-hr 567439
da-hu 665543
da-is 413380
da-it 698557
da-lt 638068
da-lv 679312
da-mt 178242
da-nb 527156
da-nl 695739
da-pl 649178
da-pt 700981
da-ro 694742
da-sk 678881
da-sl 668616
da-sv 700183
de-el 665313
de-es 681147
de-et 659663
de-fi 642893
de-fr 679446
de-hr 555682
de-hu 663558
de-is 391919
de-it 684120
de-lt 629881
de-lv 672889
de-mt 174310
de-nb 498117
de-nl 685510
de-pl 645228
de-pt 682553
de-ro 683435
de-sk 668101
de-sl 660206
de-sv 673598
el-es 702662
el-et 659331
el-fi 644226
el-fr 695326
el-hr 569723
el-hu 655679
el-is 412244
el-it 703988
el-lt 627192
el-lv 679866
el-mt 165931
el-nb 501392
el-nl 678435
el-pl 662627
el-pt 702512
el-ro 705311
el-sk 675665
el-sl 665568
el-sv 676700
en-bg 772699
en-cs 779082
en-da 775675
en-de 760573
en-el 781987
en-es 777371
en-et 769067
en-fi 753743
en-fr 773622
en-hr 650029
en-hu 772358
en-is 542623
en-it 778598
en-lt 764030
en-lv 783489
en-mt 410809
en-nb 581379
en-nl 762433
en-pl 762903
en-pt 775623
en-ro 783741
en-sk 780097
en-sl 766138
en-sv 759845
es-et 671438
es-fi 653518
es-fr 721529
es-hr 577036
es-hu 669217
es-is 409859
es-it 728161
es-lt 643449
es-lv 689442
es-mt 192855
es-nb 514227
es-nl 696265
es-pl 666301
es-pt 724413
es-ro 723721
es-sk 687738
es-sl 679832
es-sv 690789
et-fi 660751
et-fr 664365
et-hr 555108
et-hu 659073
et-is 386555
et-it 673229
et-lt 640116
et-lv 687985
et-mt 158544
et-nb 487250
et-nl 664910
et-pl 642047
et-pt 672169
et-ro 674086
et-sk 667171
et-sl 663678
et-sv 663562
fi-fr 647676
fi-hr 536734
fi-hu 642357
fi-is 380633
fi-it 654485
fi-lt 616363
fi-lv 654867
fi-mt 153451
fi-nb 474817
fi-nl 651602
fi-pl 624505
fi-pt 655541
fi-ro 652027
fi-sk 645486
fi-sl 641082
fi-sv 659350
fr-hr 569302
fr-hu 665400
fr-is 398323
fr-it 720068
fr-lt 634489
fr-lv 680170
fr-mt 184268
fr-nb 506533
fr-nl 691478
fr-pl 660690
fr-pt 721454
fr-ro 718928
fr-sk 679085
fr-sl 668598
fr-sv 685867
hr-hu 552352
hr-is 396421
hr-it 576767
hr-lt 538001
hr-lv 575442
hr-mt 149037
hr-nb 483378
hr-nl 562870
hr-pl 553385
hr-pt 579221
hr-ro 596907
hr-sk 582908
hr-sl 575995
hr-sv 563854
hu-is 373683
hu-it 668495
hu-lt 630885
hu-lv 672660
hu-mt 164030
hu-nb 479632
hu-nl 666674
hu-pl 641805
hu-pt 670583
hu-ro 675916
hu-sk 668238
hu-sl 656939
hu-sv 659745
is-it 408075
is-lt 352101
is-lv 396522
is-mt 92903
is-nb 414318
is-nl 405806
is-pl 379689
is-pt 412911
is-ro 423214
is-sk 402552
is-sl 388732
is-sv 415388
it-lt 644532
it-lv 690719
it-mt 198933
it-nb 512862
it-nl 693862
it-pl 670172
it-pt 728702
it-ro 721778
it-sk 687692
it-sl 677781
it-sv 691649
lt-lv 676715
lt-mt 156245
lt-nb 457909
lt-nl 633919
lt-pl 634803
lt-pt 645002
lt-ro 652860
lt-sk 656953
lt-sl 645308
lt-sv 630975
lv-mt 172483
lv-nb 493543
lv-nl 678047
lv-pl 669278
lv-pt 689998
lv-ro 697375
lv-sk 692463
lv-sl 682249
lv-sv 675848
mt-nb 127013
mt-nl 182312
mt-pl 162080
mt-pt 195230
mt-ro 194784
mt-sk 174438
mt-sl 165415
mt-sv 180255
nb-pl 469283
nb-pt 516820
nb-ro 523810
nb-sk 495403
nb-sl 486298
nb-sv 516531
nl-nb 509464
nl-pl 651840
nl-pt 697389
nl-ro 694335
nl-sk 676722
nl-sl 667170
nl-sv 688304
pl-pt 667974
pl-ro 676046
pl-sk 666615
pl-sl 662953
pl-sv 647503
pt-ro 727356
pt-sk 692041
pt-sl 678798
pt-sv 694776
ro-sk 701384
ro-sl 685909
ro-sv 691584
sk-sl 688466
sk-sv 673352
sl-sv 662705
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
- Multilingual corpus from the Publications Office of the EU on the medical domain
- Multilingual corpus from the Publications Office of the EU on the medical domain v.2
- Multilingual corpus from the European Vaccination Information Portal
- Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020) (EN-DE).
People who downloaded this resource also downloaded the following: