Romance-Croatian Parallel Corpus


RomCro, the Parallel Corpus of Romance Languages and Croatian was constructed at the Department of Romance Languages and Literatures of the Faculty of Humanities and Social Sciences, University of Zagreb. The corpus unites five Romance languages (French, Italian, Portuguese, Romanian, Spanish) and Croatian language. It consists of literary texts from the 20th and 21st centuries. The corpus consists of the original sentences aligned with their translational equivalents in remaining five languages. The original order of sentences is scrambled.
The corpus size is 15.9 million words (15,861,605) and the distribution by languages is as follows: French 2.9 Mw (2,971,267), Italian 2.5 Mw (2,585,828), Portuguese 2.5 Mw (2,551,968), Romanian 2.6 Mw (2,647,311), Spanish 2.7 Mw (2,700,742), and Croatian 2.4 Mw (2,551,968). Total number of TUs is 142,470. In order to enable the usage of this corpus for different purposes, we provide it in two different formats: TMX and TSV. In both formats, the order of languages is Spanish (es), French (fr), Italian (it), Portuguese (pt), Romanian (ro), Croatian (hr). The TMX and the TSV files also contain notes about the original language, writer, and the title of the text the segment is from.

