General English Crawling

The General English Crawling is a 93-million-token corpus of English for the general domain built from the web by targeting a wide set of diverse urls. It consists of 93,445,485 tokens, 5,917,753 sentences and 358,900 documents.
The corpus has been developed in the framework of the CEF project MT4ALL (
We license the actual packaging of this data under a CC0 1.0 Universal License.