Multilingual corpus in HEALTH (COVID-19) domain part_1b (v.1.0) in TMX format.
This dataset has been generated out of public content available through several websites of national agencies (https://www.ecdc.europa.eu/en/COVID-19/national-sources) and selected broadact websites like (Global Voices, Voxeurop, voltairenet, etc.)
The dataset contains 134 X-Y TMX files, where not both X and Y belong to the set {CEF language plus IS and NO} (222310 TUs in total). Acquisition of data (from multi/bi-lingual websites), normalization, cleaning, deduplication and identification of parallel documents have been done by ILSP-FC tool. Multilingual embeddings (LASER) were used for alignment of segments. Merging/filtering of segment pairs has also been applied.
DSI Relevance: eHealth
People who looked at this resource also viewed the following:
- Multilingual corpus in HEALTH (COVID-19) domain part_1b (v.1.05) in TMX format.
- Multilingual content acquired from advocacy and law associations/firms, conciliation/arbitration/co-operation institutes, dispute prevention and resolution agencies (part 1 , v.1).
- OpenEdition culture-related publications. Multilingual (AR, DE, EL, EN, ES, FR, HR, IT, NL, PL, PT, RO, RU, SL, SV) collection of TMX files.
- Multilingual corpus in HEALTH (COVID-19) domain part_1b (v.1.05) in TSV/Moses-like format.
People who downloaded this resource also downloaded the following: