Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)

134 Last view: 2025-09-30

2 Last update: 2019-01-22

49 Last download: 2025-04-07

Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)

http://www.nap.bg/en/

Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency.

DSI Relevance: OnlineDisputeResolution

Distribution

Availability: Available

Licences

Public Domain

The resource is free of all known legal restrictions.

Distribution Details

IPR Holders

National Revenue Agency (BG)

Contact Person

Annie Rusinova

text

Bilingual text corpusLanguages

English (en)

Bulgarian (bg)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

TMX

Size

1,292 Translation Units

Character encoding

UTF-8

Domains

FINANCE Taxation (Eurovoc 2446)

Eurovoc Classification

Text type: Administrative Texts

Text genre: Official

Creation

Creation mode details: Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency. It was offered as collection of documents by the Bulgarian National Revenue Agency. Modules of the ILSP Focused Crawler was used for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupidh methods were discarded ; TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ; Alignments of non-[1:1] type(s) were discarded. ; Alignments with a TUV (after normalization) that has less than 1 tokens, were annotated ; Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were annotated ; Alignments in which different digits appear in each TUV were kept and annotated. ; Alignments with identical TUVs (after normalization) were annotated ; Alignments with only non-letters in at least one of their TUVs were annotated ; Duplicate alignments were kept and were annotated. The mean value of aligner's scores is 5.714609036504669, the std value is 1.8063256236105307. The mean value of length (in terms of characters) ratios is 1.0040012545201242 and the std value is 0.26545877788005745. There are 832 TUs with no annotation, containing 13336 words and 2604 lexical types in bul and 15010 words and 2031 lexical types in eng. The mean value of aligner's scores is 6.336834960545485, the std value is 1.53829791384023

Creation mode: Mixed

Original Sources

Resource Creation

Created using ELRC Services

Funding Project

European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))

URL: http://www.lr-coordi...

Funding Type: Service Contract

Funder: European Commission

Funding Country: European Union (EU)

Project duration: 13/12/2016 - 12/02/2020

Metadata

Created: 03/10/2017

Last Updated: 03/10/2017

Metadata Language: English (en)

Metadata Creator

Vassilis Papavassiliou