Catalan > Spanish CatGov Corpus

The Catalan > Spanish CatGov Corpus is a Catalan > Spanish parallel corpus composed of 63,773 segments. It has been compiled by leveraging parallel data from Spanish human translations of Catalan source documents provided by the Generalitat de Catalunya, which have been automatically cleaned and aligned.

The corpus consists of 346 documents with an average of 132 sentences per document. We also provide enriched metadata containing the alignment scores obtained by Vecalign, the domain each segment belongs to and the source filename so that the corpus can be used at document level.

We provide a test split to be used as a benchmark.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (Projecte AINA)