SciPar: A collection of parallel corpora from scientific abstracts (v. 2021) in TMX format.
Collection of 31 bilingual TMX files for EN-X language pairs, where X is BG, CS, DE, EL, EN, ES, ET, FI, FR, HR, HU, IS, IT, LT, LV, MK, NB, NN, PL, PT, RU, SK, SL, SQ, SV. It also contains small collection for a few more language combinations. It was generated by processing abstracts of Bachelor, Master and PhD Theses available at academic repositories and archives. The total number of Tus is 9172462.
de-es 268
de-fr 281
de-ru 198
en-bg 2301
en-cs 1064384
en-de 890184
en-el 742986
en-es 354459
en-et 83478
en-fi 457341
en-fr 1123121
en-hr 806580
en-hu 27421
en-is 110830
en-it 31279
en-lt 177436
en-lv 347472
en-mk 4940
en-nb 56055
en-nn 2380
en-pl 862075
en-pt 974167
en-ru 3063
en-sk 60467
en-sl 300016
en-sq 7779
en-sv 670815
es-fr 4915
es-ru 728
fr-ru 1333
mk-sq 3710
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
- HRW dataset v1. Multilingual (EN, AR, BG, BN, CS, DA, DE, EL, ES, FA, FI, FR, HR, HU, IN, IT, KO, LV, NB, NL, PL, PT, RU, SK, SQ, SV, TH, TL, TR, UK, UR, Vi, ZH)
- Web-acquired data related to Scientific research (Part I). Multilingual (BG, CS, DA, DE, EN, ES, ET, FR, GA, HR, IT, LT, LV, NB, NL, PL, PT, RU, SK, SV, UK) collection of files in TMX format.
- COVID-19 - HEALTH Wikipedia dataset. Multilingual (52 EN-X language pairs)
- Multilingual corpus from the Publications Office of the EU on the medical domain