Newswire Basque Crawling

The Newswire Basque Crawling is a 106-million-token corpus of Basque built from the web by targeting specific in-domain urls from newspapers. It consists of 106,857,055 tokens, 7,448,381 sentences and 437,563 documents.
Documents are separated by single new lines.
The corpus has been developed in the framework of the CEF project MT4ALL (
We license the actual packaging of this data under a CC0 1.0 Universal License.