General Norwegian Crawling

The General Norwegian Crawling is a 43-million-token corpus of Norwegian for the general domain built from the web by targeting a wide set of diverse urls. It consists of 43,424,915 tokens, 2,692,915 sentences and 108,470 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (http://ixa2.si.ehu.eus/mt4all/project)
We license the actual packaging of this data under a CC0 1.0 Universal License.