The Education Act Parallel Corpus

The Education Act states that all textbooks in other subjects then Norwegian have to be made available in both written languages (Norwegian Bokmål and Norwegian Nynorsk). This corpus consists of translation units from these textbooks extracted from the national repository at the National Library of Norway.

The corpus has gone through extensive (pre)prossessing. Document alignment is based on edit distance between several metadata fields. The books in each language is then tokenized and sentence aligned. Sentence alignment is done with the sentence aligner Hunalign. The final translation units are then cleaned in the Iconic Translation Machines pipeline and used to train a baseline translation engine.