MARCELL Romanian legislative subcorpus v2

The Romanian corpus contains 163,274 files, which represent the body of national legislation ranging from 1881 to 2021. This corpus includes mainly: governmental decisions, ministerial orders, decisions, decrees and laws. All the texts were obtained via crawling from the public Romanian legislative portal . We have not distinguished between in force and "out of force" laws because it is difficult to do this automatically and there is no external resource to use to distinguish between them. The texts were extracted from the original HTML format and converted into TXT files. Each file has multiple levels of annotation: firstly the texts were tokenized, lemmatized and morphologically annotated using the Tokenizing, Tagging and Lemmatizing (TTL) text processing platform developed at RACAI, then dependency parsed with NLP-Cube, named entities were identified using a NER tool developed at RACAI, nominal phrases were identified also with TTL, while IATE terms and EuroVoc descriptors were identified using an internal tool. All processing tools were integrated into an end-to-end pipeline available within the RELATE platform and as a dockerized version. The files were annotated with the latest version of the pipeline completed within Activity 4 of the MARCELL project.