Creation mode details: The ILSP Focused Crawler was used for the acquisition of bilingual data from multilingual websites, and for the normalization, cleaning, deduplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupdih methods were discarded. ; TMX files with a zeroToOne_alignments/total_alignments ratio is larger than 0.15, were discarded. ; Alignments of non-[1:1] were discarded. ; Alignments with a TUV (after normalization) that has less than 0 tokens, were discarded. ; Alignments with a TUVs' length ratio less than 0.6 or more than 1.6, were discarded. ; Alignments in which different digits appear in each TUV were discarded. ; Alignments with identical TUVs were discarded. ; Duplicate alignments were discarded.
European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))