CEF Data Marketplace multilingual benchmark for the evaluation of cleaning and clustering tools

Five parallel corpora (En-Cs, En-De, En-It, En-Lv, De-It) manually annotated by professional translators. Each translation unit (TU) included in the datasets is annotated with information about whether (i) it is clean - i.e. the translation is correct and fully equivalent to its source text, and (ii) it belongs to the Legal domain. The resulting gold standards were used to evaluate the Cleaning and Clustering services offered by the CEF Data Marketplace platform.