General German Crawling

The General German Crawling is a 67-million-token corpus of German for the general domain built from the web by targeting a wide set of diverse urls. It consists of 67,637,441 tokens, 4,880,000 sentences and 344,712 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (
We license the actual packaging of this data under a CC0 1.0 Universal License.