hrenWaC 2.0 Croatian-English Parallel Corpus

34 Last view: 2024-06-25

3 Last update: 2022-11-17

hrenWaC 2.0 Croatian-English Parallel Corpus

View resource name in all available languages

hrenWaC 2.0 Hrvatsko-engleski paralelni korpus

Attribution details: hrenWaC 2.0 Croatian-English Parallel Corpus by Nikola Ljubešić available for use of DGT for eTranslation development with permission from corpus author.

hrenWaC 2.0 Croatian-English Parallel Corpus contains documents in the general domain, totaling 1,554,912 sentence pairs. The corpus contains texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor) with the accuracy of the extracted bitext on the segment level of around 80%. A manual content and alignment check was performed on a sample. Contains 6228 TMX files. Data are contributed exclusively for use of DGT for eTranslation development.

View resource description in all available languages

hrenWaC 2.0 Hrvatsko-engleski paralelni korpus opće domene sadrži 1.554.912 rečeničnih parova. Korpus se sastoji od tekstova automatski prikupljenih s hrvatske vršne domene .hr. Korpus je razvijen s alatom Spidextor (https://github.com/abumatran/spidextor) s točnošću od 80% za ekstrakciju biteksta na razini segmenta. Na uzorku je provedena ručna provjera sadržaja i sravnjivanja. Sadrži 6228 datoteka u TMX formatu. Podaci se isključivo koriste za razvoj sustava eTranslation od strane DGT-a.

Distribution

Availability: Available

Licences

Non-standard/ Other Licence/ Terms

Distribution Details

Attribution Details: hrenWaC 2.0 Croatian-English Parallel Corpus by Nikola Ljubešić available for use of DGT for eTranslation development with permission from corpus author.

IPR Holders

Nikola Ljubešić