SciPar: A collection of parallel corpora from scientific abstracts (v. 2021) in MOSES format.

Collection of 31 pairs of MOSES-like files for EN-X language pairs, where X is BG, CS, DE, EL, EN, ES, ET, FI, FR, HR, HU, IS, IT, LT, LV, MK, NB, NN, PL, PT, RU, SK, SL, SQ, SV. It also contains small collection for a few more language combinations. It was generated by processing abstracts of Bachelor, Master and PhD Theses available at academic repositories and archives. The total number of Tus is 9172462.
de-es 268
de-fr 281
de-ru 198
en-bg 2301
en-cs 1064384
en-de 890184
en-el 742986
en-es 354459
en-et 83478
en-fi 457341
en-fr 1123121
en-hr 806580
en-hu 27421
en-is 110830
en-it 31279
en-lt 177436
en-lv 347472
en-mk 4940
en-nb 56055
en-nn 2380
en-pl 862075
en-pt 974167
en-ru 3063
en-sk 60467
en-sl 300016
en-sq 7779
en-sv 670815
es-fr 4915
es-ru 728
fr-ru 1333
mk-sq 3710