SoNaR Corpus

102 Last view: 2025-07-06

5 Last update: 2019-09-26

The SoNaR Corpus 1.2.1 contains the final results of the STEVIN project SoNaR.The STEVIN SoNaR project has resulted in two datasets, viz. SoNaR-500 and SoNaR-1.

SONAR-500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types. All texts have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. In the case of SoNaR-500 all annotations were produced automatically, no manual verification took place.

Distribution

Availability: Available

Licences

Non-standard/ Other Licence/ Terms

Conditions: Non Commercial Use

Distribution Details

Contact Person

Carole Tiberius

text

Monolingual text corpusLanguages

Dutch; Flemish (nl)

Linguality

Linguality type: Monolingual

Text Format

XML

Size

500,000,000 Tokens

Character encoding

UTF-8

Resource Creation

Funding Project

Connecting Europe Facility-European Language Resource Coordination (CEF-ELRC - LANGUAGE RESOURCE COORDINATION-SMART 2014/1074-30-CE-0696785/00-64)

URL: http://www.lr-coordi...

Funding Type: Service Contract

Funder: European Commission

Funding Country: European Union (EU)

Project duration: 29/03/2015 - 16/04/2017

Metadata

Created: 12/04/2017

Last Updated: 12/04/2017

Metadata Language: English (en)

Metadata Creator

Fraser Bowen

Relations

Relation Type: Has Converted Version

People who looked at this resource also viewed the following:

Resources from the same project