MARCELL Croatian legislative subcorpus

The Croatian corpus consists of 33,561 documents that represent the national legislation from 1990 until October 2019. The corpus is composed of legally binding acts (laws, regulations, decisions, orders, etc.) and internally binding acts (ordinances, recommendations, etc.). There are 12 different texts types with ordinances (11,521), decisions (7,735) and laws (3,798) as three most frequent text types. In collaboration with the Central State Office for the Development of the Digital Society of the Republic of Croatia (RDD), which has, as a part of its mission, the securing of online accessibility to all Croatian legal documentation, we received the final data set from their database in October 2019 and we are presenting the figures of that current state.
Regarding the copyright issues, at the web page of RDD there is a statement: “Information for reuse on the website of the Central Catalogue of Official Documents of the Republic of Croatia of the Central Office for the Development of the Digital Society is available to users without restrictions and for free use with Open license. The Open licence shall allow the user to use any information to which it relates, including the spatial and temporal unlimited, free of charge, not exclusive and personal right to use the information subject to the licence. The open licence relates both to the content and structure of the dataset in question representing public sector information, as well as to metadata relating to the information concerned.”
The data were delivered in a proprietary XML format that had to be converted into a CoNLL-U Plus format and the relevant accompanying metadata were extracted from the RDD database.
The corpus was analysed with the Croatian Language Web Services: paragraphs and sentences are split, tokens are identified and morphologically and syntactically annotated. An annotation tool is being developed to annotate IATE terms and EuroVoc descriptors within the corpus by the way of matching these terms with SWE/MWEs in the corpus. The corpus overall size is almost 10.3 M sentences and around 102 M tokens.