MARCELL Hungarian legislative subcorpus v2

The Hungarian corpus representing the Hungarian national legislation contains 26821 documents retrieved from PDF files of the official gazette Magyar Közlöny which is freely available online for download. There are 11 different text types in the corpus covering different kinds of legal texts: law, regulation, decree, etc. The documents were published in the period between 1991 and 2019.
The data was analysed with the e-magyar text processing system. The system was enhanced with detokenization functionality (precisely for the requirements of the MARCELL project) to provide SpaceAfter=No annotation indicating no whitespace between two tokens in the original text. Additional scripts were created for extracting the necessary metadata, for converting to CoNLL-U Plus format, for annotating IATE terms and EuroVoc descriptors in the text, as well as for classifying the documents into top-level EuroVoc domains. EuroVoc MT codes corresponding to the EuroVoc descriptors were also added to the annotation.
The raw data is 31.2M tokens, the analysed corpus is 2.9GB in CoNLL-U Plus format.