SoNaR Corpus (processed)

SoNaR corpus, version 1.2.1, Dutch, Taalunie, 2015, as well as the total content of the downloaded or supplied files, including but not limited to (i) any supplied or Product part software or computer information and (ii) related written materials or files for explanation;

The SoNaR Corpus 1.2.1 contains the final results of the STEVIN project SoNaR.The STEVIN SoNaR project has resulted in two datasets, viz. SoNaR-500 and SoNaR-1.

SONAR-500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types. All texts have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. In the case of SoNaR-500 all annotations were produced automatically, no manual verification took place.

The processed version of SONAR500 contains Dutch monolingual texts in 22 domains extracted from SONAR data.