MARCELL Slovenian legislative subcorpus v2

The Slovenian corpus contains 25 thousand documents (5 GB in size, 148 M tokens), ranging from 1974 to 2020. The data was obtained from the Slovenian Open Data Portal. The original file type is JSON which contains individual document in HTML format. The data in the corpus was extracted from the HTML documents, tokenized with the Slovenian tokenizer Obeliks4j (Grcar et al., 2012), and lemmatized, tagged and dependency parsed with a fork of the StanfordNLP parser (Peng et al., 2018) trained on ssj500k training corpus (Krek et al., 2017). Additional scripts have been created to extract metadata and annotate IATE terms and EuroVoc descriptions. The legislation is published in the Slovenian Open Data Portal under the CC-BY 4.0 license.

DSI Relevance: eJustice