Corpus Crawler – ELRC-SHARE

58 Last view: 2024-06-26

4 Last update: 2019-10-18

Corpus Crawler

https://github.com/googlei18n/corpuscrawler,

https://opensource.google.com/projects/corpuscrawler

Corpus Crawler is a tool for Corpus Linguistics. Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

Distribution

Availability: Available

Licences

Distribution Details

IPR Holders

Contact Person

toolService

Tool (Web Crawling)

Language Dependent

Resource Creation

Funding Project

Not Applicable (N/A)

Funding Type: Other

Metadata

Created: 14/05/2019

Last Updated: 14/05/2019

Metadata Language: English (en)

People who looked at this resource also viewed the following:

Resources from the same project