Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.05) in TMX format.
This dataset has been generated out of public content available through several websites of national agencies (https://www.ecdc.europa.eu/en/COVID-19/national-sources) and selected broadact websites like (Global Voices, Voxeurop, voltairenet, etc.)
The dataset contains 327 X-Y TMX files, where X and Y belong to the set {CEF language plus IS and NO} (3044961 TUs in total). Acquisition of data (from multi/bi-lingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Multilingual embeddings (LASER) were used for alignment of segments. Merging/filtering of segment pairs has also been applied.
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
- Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.05) in TSV/MOSES-like format.
- COVID-19 Government of Canada dataset v2. Multilingual (EN, FR, DE, ES, EL, IT, PL, PT, RO, KO, RU, ZH, UK, VI, TA, TL)
- Web-acquired data related to culture (Part I). Multilingual (BG, CS, DA, DE, EL, EN, ET, FI, FR, HR, IS, IT, LT, LV, MK, MT, RU, SK, SV) collection of files in Moses format.
- Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.0) in TSV/MOSES-like format.
People who downloaded this resource also downloaded the following:
- COVID-19 - HEALTH Wikipedia dataset. Multilingual (52 EN-X language pairs)
- COVID-19-related multilingual corpus from EU press Corner 2020 v.0.9 in TMX format
- Multilingual corpus from the European Vaccination Information Portal
- COVID-19 OSHA-EUROPA dataset v1. Multilingual (CEF languages plus IS and NB but not Irish)