Bulgarian web corpus MaCoCu-bg 1.0

22 Last view: 2024-05-25

3 Last update: 2023-08-03

Bulgarian web corpus MaCoCu-bg 1.0

MaCoCu-bg 1.0

https://macocu.eu/

The Bulgarian web corpus MaCoCu-bg 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well (https://github.com/macocu/MaCoCu-crawler).

Considerable efforts were devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

Each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score (based on a language model). The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality and fluency, the automatically identified language of the text in the paragraph, and information whether the paragraph contains personal information.

This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

DSI Relevance: BusinessRegistersInterconnectionSystem, Cybersecurity, ElectronicExchangeOfSocialSecurityInformation, Europeana, OnlineDisputeResolution, OpenDataPortal, eHealth, eJustice, eProcurement, saferInternet

Distribution

Availability: Under Review

Licences

CC0-1.0

Distribution Details

Download location : http://hdl.handle.ne...

Distribution Medium: Data Downloadable

Personal Data: YES

Contact Person

Miquel Esplà-Gomis

text

Monolingual text corpusLanguages

Bulgarian (bg)

Linguality

Linguality type: Monolingual

Text Format

XML

Size

10,526,510 Files

3,508,930,378 Words

Character encoding

UTF-8

Creation

Creation mode: Automatic

Resource Creation

Creation lasted: 01/06/2021 - 30/04/2022

Funding Project

MaCoCu-Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages (MaCoCu - INEA/CEF/ICT/A2020/2278341)

URL: https://macocu.eu

Funding Type: Eu Funds

Funder: European Union's Connecting Europe Facility 2014-2020-CEF Telecom

Project duration: 01/06/2021 - 31/05/2023

Metadata

Created: 27/04/2022

Last Updated: 27/04/2022

Metadata Language: English (en)

Version

Version: 1.0

Last Updated: 27/04/2022

Relations

Relation Type: Is Part Of

People who looked at this resource also viewed the following: