European Commission Corpus

The English > Catalan European Commission Corpus is an English > Catalan parallel corpus composed of 46,048 segments. It has been compiled by leveraging parallel data from Catalan human translations of English source documents provided by the Representation of the European Commission in Barcelona. The source files belong to six different domains: CIE (Science and Technology), ECO (Economy), EDU (Education), ENV (Environment), INS (Institutional), SOC (Social Issues).
The corpus consists of 1071 documents with an average of 43 sentences per document. We also provide enriched metadata containing the alignment scores obtained by Vecalign, the domain each segment belongs to and the source filename so that the corpus can be used at document level.

We provide a test split to be used as a benchmark.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (Projecte AINA)