Parallel corpus (Polish - English) from the website of the Polish Investment and Trade Agency (Processed)
Parallel corpus (Polish - English) was created for the European Language Resources Coordination Action (ELRC) (http://lr-coordination.eu/) by ELRC Consortium partner, ILSP/R.C. "Athena" (https://www.athena-innovation.gr/) from the website of the Polish Investment and Trade Agency (https://www.paih.gov.pl)
Parallel (pl-en) corpus of 14736 translation units in the "BUSINESS AND COMPETITION" and "ECONOMICS" domains.
Creation mode details: The ILSP Focused Crawler was used for the acquisition of bilingual data from multilingual websites, and for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupdih methods were discarded ; TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ; Alignments of non-[1:1] type(s) were discarded. ; Alignments with a TUV (after normalization) that has less than 3 tokens, were discarded/annotated ; Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were discarded/annotated ; Alignments in which different digits appear in each TUV were kept and annotated. ; Alignments with identical TUVs (after normalization) were removed. ; Alignments with only non-letters in at least one of their TUVs were removed ; Exact Duplicate alignments were discarded. The mean value of aligner's scores is 6.009806921990764, the std value is 0.9243181998035448. The mean value of length (in terms of characters) ratios is 1.0146147524328113 and the std value is 0.19039534371430675. There are 14736 TUs with no annotation, containing 270941 words and 15428 lexical types in en and 222325 words and 31073 lexical types in pl.
European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))