hrenWaC 2.0 Croatian-English Parallel Corpus

View resource name in all available languages

hrenWaC 2.0 Hrvatsko-engleski paralelni korpus

hrenWaC 2.0 Croatian-English Parallel Corpus by Nikola Ljubešić available for use of DGT for eTranslation development with permission from corpus author.

hrenWaC 2.0 Croatian-English Parallel Corpus contains documents in the general domain, totaling 1,554,912 sentence pairs. The corpus contains texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor) with the accuracy of the extracted bitext on the segment level of around 80%. A manual content and alignment check was performed on a sample. Contains 6228 TMX files. Data are contributed exclusively for use of DGT for eTranslation development.

View resource description in all available languages

hrenWaC 2.0 Hrvatsko-engleski paralelni korpus opće domene sadrži 1.554.912 rečeničnih parova. Korpus se sastoji od tekstova automatski prikupljenih s hrvatske vršne domene .hr. Korpus je razvijen s alatom Spidextor (https://github.com/abumatran/spidextor) s točnošću od 80% za ekstrakciju biteksta na razini segmenta. Na uzorku je provedena ručna provjera sadržaja i sravnjivanja. Sadrži 6228 datoteka u TMX formatu. Podaci se isključivo koriste za razvoj sustava eTranslation od strane DGT-a.