MARCELL Croatian-English Parallel Corpus of Legislative Texts

MARCELL Croatian-English Parallel Corpus of Legislative Texts contains the total body of Croatian legislative documents (1563 documents) which are translated into English and a set of Croatia’s international treaties (253 documents), totaling to 1816 documents. The size in tokens is 14,379,657 in Croatian and 17,673,788 in English. This parallel corpus is processed at the level of paragraph and sentence splitting, segment alignment and each of 396,984 translation units (TUs) was manually checked for alignment. The file format is TMX (v1.4) while in the header additional metadata on document type, year of production, attributed EUROVOC descriptor or descriptors, and domain is stored.