CURLICAT Slovenian corpus

This is the Slovenian language subcorpus of the collection of curated and analysed language data compiled by the CURLICAT project. It consists of over 2 million sentences, 43,5 m tokens linguistically analyzed, and enriched with IATE and domain specific terminology extracted from the subcorpus. The structure of the corpus as regards sources shows a predominance of longer texts in books and news media covering the following domains: culture, economics, education, finance, health and politics. For more information see the delivery reports D1.1 and D2 of the curlicat website (