Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.0) in TSV/MOSES-like format.
This dataset has been generated out of public content available through several websites of national agencies (https://www.ecdc.europa.eu/en/COVID-19/national-sources) and selected broadact websites like (Global Voices, Voxeurop, voltairenet, etc.)
The dataset contains 327 X-Y TSV/MOSES-like (pairs of) files, where X and Y belong to the set {CEF language plus IS and NO} (3905604 TUs in total). Acquisition of data (from multi/bi-lingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Multilingual embeddings (LASER) were used for alignment of segments. Merging/filtering of segment pairs has also been applied.
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
- FSPO Ombudsman's Digest of Decisions Volume 2
- COVID-19 Line 1177 of Sweden dataset v1. Multilingual (EN, BG, DE, ES, FI, FR, PL, RO, RU, SV, TR)
- COVID-19 OSHA-EUROPA dataset v1. Multilingual (CEF languages plus IS and NB but not Irish)
- Multilingual corpus in HEALTH (COVID-19) domain part_1a (v.1.05) in TMX format.
People who downloaded this resource also downloaded the following:
- Web-acquired data related to health/covid-19 (Part I). Multilingual (BG, CS, DA, DE, EL, EN, ET, ES, FI, FR, GA, HR, HU, IS, IT, LT, LV, MK, MT, NL, NB, NN, NO, PL, PT, RO, SK, SL, SQ, SV) collection of files in TMX format.
- Multilingual corpus from the Publications Office of the EU on the medical domain
- COVID-19 OSHA-EUROPA dataset v1. Multilingual (CEF languages plus IS and NB but not Irish)
- Web-acquired data related to health/covid-19 (Part I). Multilingual (BG, CS, DA, DE, EL, EN, ET, ES, FI, FR, GA, HR, HU, IS, IT, LT, LV, MK, MT, NL, NB, NN, NO, PL, PT, RO, SK, SL, SQ, SV) collection of files in Moses-like format.