Catalan > English CatGov Corpus

The Catalan > English CatGov Corpus is a Catalan > English parallel corpus composed of 37,116 segments. It has been compiled by leveraging parallel data from English human translations of Catalan source documents provided by the Generalitat de Catalunya, which have been automatically cleaned and aligned.

The corpus consists of 77 documents with an average of 482 sentences per document. We also provide enriched metadata containing the alignment scores obtained by Vecalign, the domain each segment belongs to and the source filename so that the corpus can be used at document level.

We provide a test split to be used as a benchmark.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (Projecte AINA)