Dutch eJustice Websites (nl-en)

58 Last view: 2025-09-07

1 Last update: 2020-02-11

Dutch eJustice Websites (nl-en)

Parallel (en-nl) corpus of 23849 translation units.

DSI Relevance: eJustice

Distribution

Availability: Available

Licences

Under Review

Distribution Details

Contact Person

Fraser Bowen

text

Bilingual text corpusLanguages

Dutch; Flemish (nl)

English (en)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

TMX

Size

23,849 Translation Units

Character encoding

UTF-8

AnnotationAlignment

StandOff: False

Segmentation level: Sentence

Standard practices conformance: TMX

Annotation Mode: Automatic

Annotation Tools:

ILSP-FC alignment and TMX filtering module

Creation

Creation mode details: The ILSP Focused Crawler was used for the acquisition of bilingual data from multilingual websites, and for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupdih methods were discarded ; TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ; Alignments of non-[1:1] type(s) were discarded. ; Alignments with a TUV (after normalization) that has less than 2 tokens, were discarded/annotated ; Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were discarded/annotated ; Alignments in which different digits appear in each TUV were discarded/annotated ; Alignments with identical TUVs (after normalization) were removed. ; Alignments with only non-letters in at least one of their TUVs were removed ; Duplicate alignments were discarded. There are 77334 TUs with no annotation, containing 1714869 words and 34505 lexical types in en and 1715488 words and 55759 lexical types in nl

Creation mode: Automatic

Creation Tools

http://nlp.ilsp.gr/r...

Resource Creation

Resource Creator

Deutsches Forschungszentrum für Kunstliche Intelligenz

Creation ended: 07/10/2016

Created using ELRC Services

Funding Project

European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))

URL: http://www.lr-coordi...

Funding Type: Service Contract

Funder: European Commission

Funding Country: European Union (EU)

Project duration: 13/12/2016 - 12/02/2020

Metadata

Created: 17/10/2016

Last Updated: 17/10/2016

Metadata Language: English (en)

Metadata Creator

Fraser Bowen

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators