GEnCaTA: a parallel Catalan-English corpus

GEnCaTA is a Catalan↔English parallel corpus composed of 38,595 segments. It has been compiled by leveraging parallel data from crawling the gencat.cat domain and subdomains, belonging to the Catalan Government, both in English and Catalan.

The file urls.txt includes the origin url per each aligned sentence.
The file scores.txt includes the scores given by vecalign.