General Spanish Crawling

The General Spanish Crawling is a 16-million-token corpus of Spanish for the general domain built from the web by targeting a wide set of diverse urls. It consists of 16,725,511 tokens, 895,644 sentences and 41,258 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (
We license the actual packaging of this data under a CC0 1.0 Universal License.