CEF Data Marketplace multilingual benchmark for the evaluation of cleaning and clustering tools

95 Last view: 2025-09-06

3 Last update: 2020-10-30

28 Last download: 2025-04-22

CEF Data Marketplace multilingual benchmark for the evaluation of cleaning and clustering tools

CEF-DM Multilingual Benchmark

Five parallel corpora (En-Cs, En-De, En-It, En-Lv, De-It) manually annotated by professional translators. Each translation unit (TU) included in the datasets is annotated with information about whether (i) it is clean - i.e. the translation is correct and fully equivalent to its source text, and (ii) it belongs to the Legal domain. The resulting gold standards were used to evaluate the Cleaning and Clustering services offered by the CEF Data Marketplace platform.

Distribution

Availability: Available

Licences

CC-BY-4.0

Conditions: Attribution

Distribution Details

Distribution Medium: Data Downloadable

Contact Persons

Luisa Bentivogli

Marco Turchi

text

Bilingual text corpusLanguages

Latvian (lv) (2,500 Sentences)

Czech (cs) (2,500 Sentences)

English (en) (2,500 Sentences)

Italian (it) (2,500 Segments)

German (de) (2,500 Sentences)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel (Parallel corpora for 5 language directions ((En-Cs, En-De, En-It, En-Lv, De-It))

Text Format

Plain Text

Size

2,500 Sentences

Resource Creation

Resource Creator

Fondazione Bruno Kessler

TAUS

Translated

Funding Project

CEF Data Marketplace (CEF-Data-Marketplace)

URL: https://ec.europa.eu...

Funding Type: Eu Funds

Metadata

Created: 30/10/2020

Last Updated: 30/10/2020

Metadata Language: English (en)

Metadata Creator

Luisa Bentivogli

Version

Version: 1.0

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following:

Resources from the same project

Resources from the same creators