Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.05) in TMX format.
This dataset has been generated out of public content available through several websites of national agencies (https://www.ecdc.europa.eu/en/COVID-19/national-sources) and selected broadact websites like (Global Voices, Voxeurop, voltairenet, etc.)
The dataset contains 327 X-Y TMX files, where X and Y belong to the set {CEF language plus IS and NO} (3044961 TUs in total). Acquisition of data (from multi/bi-lingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Multilingual embeddings (LASER) were used for alignment of segments. Merging/filtering of segment pairs has also been applied.
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
- Web-acquired data related to culture (Part I). Multilingual (BG, CS, DA, DE, EL, EN, ET, FI, FR, HR, IS, IT, LT, LV, MK, MT, RU, SK, SV) collection of files in Moses format.
- Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.0) in TSV/MOSES-like format.
- COVID-19 Voltaire dataset v1. Multilingual (EN, AR, CS, DE, EL, ES, FA, FR, IT, NB, NL, NN, PL, PT, RO, RU, TR)
- Multilingual corpus in HEALTH (COVID-19) domain part_1b (v.1.0) in TSV/MOSES-like format.
People who downloaded this resource also downloaded the following:
- Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.0) in TMX format.
- Compilation of Croatian-Dutch; Flemish parallel corpora resources used for training of NTEU Machine Translation engines. Tier 3.
- COVID-19-related multilingual corpus from EU press Corner 2020 v.0.9 in TMX format
- Multilingual corpus from the European Vaccination Information Portal