Home
Browse Resources
Help
About
What is ELRC-SHARE
LR Provision
Access to ELRC-SHARE Language Resources
Licensing LRs for the ELRC action
Notice and Takedown Policy
Disclaimers and Limitation of Liability
Log information, cookies and analytics
Data Protection Record
Register
Login
50
Last view: 2024-11-20
3
Last update: 2020-02-19
12
Last download: 2023-10-20
Dutch Government Website
Parallel (en-nl) corpus of 6532 translation units.
Back
Download
Distribution
Availability:
Available
Licences
CC0-1.0
Distribution Details
Contact Person
Fraser Bowen
Deutsches Forschungszentrum für Kunstliche Intelligenz
DFKI
[javascript protected email address]
Germany
http://www.dfki.de
DFKI
Germany
text
1
2
Bilingual text corpus
Languages
Dutch; Flemish (nl)
Language Script:
Latin
English (en)
Language Script:
Latin
Linguality
Linguality type:
Bilingual
Text Format
TMX
Size
6,532 Translation Units
Character encoding
UTF-8
Bilingual text corpus
Languages
Dutch; Flemish (nl)
Language Script:
Latin
English (en)
Language Script:
Latin
Linguality
Linguality type:
Bilingual
Multi-linguality type:
Parallel
Text Format
TMX
Size
6,532 Translation Units
Character encoding
UTF-8
Annotation
Alignment
StandOff:
False
Segmentation level:
Sentence
Standard practices conformance:
TMX
Annotation Mode:
Automatic
Annotation Tools:
ILSP-FC alignment and TMX filtering module
Creation
Creation mode details:
The ILSP Focused Crawler was used for the acquisition of bilingual data from multilingual websites, and for the normalization, cleaning, deduplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupdih methods were discarded. ; TMX files with a zeroToOne_alignments/total_alignments ratio is larger than 0.15, were discarded. ; Alignments of non-[1:1] were discarded. ; Alignments with a TUV (after normalization) that has less than 0 tokens, were discarded. ; Alignments with a TUVs' length ratio less than 0.6 or more than 1.6, were discarded. ; Alignments in which different digits appear in each TUV were discarded. ; Alignments with identical TUVs were discarded. ; Duplicate alignments were discarded.
Creation mode:
Automatic
Creation Tools
http://nlp.ilsp.gr/r...
Resource Creation
Resource Creator
Deutsches Forschungszentrum für Kunstliche Intelligenz
http://www.dfki.de
Multilinguale Technologien
Deutsches Forschungszentrum für Kunstliche Intelligenz
DFKI
Germany (DE)
Creation ended:
16/05/2016
Created using ELRC Services
Funding Project
European Language Resource Coordination LOT3
(ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))
URL:
http://www.lr-coordi...
Funding Type:
Service Contract
Funder:
European Commission
Funding Country:
European Union (EU)
Project duration:
13/12/2016 - 12/02/2020
Metadata
Created:
17/05/2016
Metadata Language:
English (en)
People who looked at this resource also viewed the following:
Statens Vegvesen Translation Memories
OROSSIMO Corpus - Photography - film & video (Processed)
Czech Museum Websites
CNIO (Processed)
People who downloaded this resource also downloaded the following:
Dutch laws as Dutch monolingual corpus from www.overheid.nl web site
Documents from the Ministry of Agriculture, Forestry and Food of the Republic of Slovenia (EN-SL) (Processed)
Czech-English Parallel corpus from Tatoeba project
DA-EN Danish Ministry of Higher Education and Science 2 (Processed)
Resources from the same project
Resources from the same creators