MARCELL Bulgarian legislative subcorpus v2

The MARCELL Bulgarian subcorpus consists of 29,648 documents (at the end of March 2021) which are classified into fifteen types. The time span of the documents is 1946–2021.
The data has been retrieved from the Bulgarian State Gazette (http://dv.parliament.bg), the Bulgarian government official journal, publishing documents from the official institutions like government, National Assembly of Bulgaria, Constitutional Court, etc. A C++ based NLP Pipeline for Bulgarian, constructed such as to answer the requirements of the project for autonomy and sustainability, is continuously feeding the Bulgarian corpus with newly issued legislative documents. Data is extracted from a single web source and further transformed. The transformation phase makes changes to the data format, filters document tapes, organises data in structures, and accumulates data with metadata and linguistic information. The annotation modules of the pipeline integrate a sentence splitter, a tokeniser, a part-of-speech tagger, a lemmatiser, a UD parser, a named entity recogniser, a noun phrase parser, an IATE term annotator, an Eurovoc descriptor annotator and an Eurovoc MT annotator (https://www.aclweb.org/anthology/2020.lrec-1.863/). The documents are classified into the EuroVoc Top Level Domains. The classification module is made available as a part of the Bulgarian NLP Pipeline.

Project: https://marcell-project.eu