Creation mode details: Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency. It was offered as collection of documents by the Bulgarian National Revenue Agency. Modules of the ILSP Focused Crawler was used for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupidh methods were discarded ; TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ; Alignments of non-[1:1] type(s) were discarded. ; Alignments with a TUV (after normalization) that has less than 1 tokens, were annotated ; Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were annotated ; Alignments in which different digits appear in each TUV were kept and annotated. ; Alignments with identical TUVs (after normalization) were annotated ; Alignments with only non-letters in at least one of their TUVs were annotated ; Duplicate alignments were kept and were annotated. The mean value of aligner's scores is 5.714609036504669, the std value is 1.8063256236105307. The mean value of length (in terms of characters) ratios is 1.0040012545201242 and the std value is 0.26545877788005745. There are 832 TUs with no annotation, containing 13336 words and 2604 lexical types in bul and 15010 words and 2031 lexical types in eng. The mean value of aligner's scores is 6.336834960545485, the std value is 1.53829791384023
Creation mode: Mixed
Original Sources
447
Resource Creation
Created using ELRC Services
Funding Project
European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))