ParaCrawl release 9 English-Polish - deferred files

23 Last view: 2024-09-27

2 Last update: 2021-11-25

ParaCrawl release 9 English-Polish - deferred files

ParaCrawl 9 en-pl

https://paracrawl.eu

This file contains URLs and hashes of text to form a parallel corpus but not the sentences itself. You probably want the actual parallel data; see the version without "deferred files" in the title. To reconstruct a parallel corpus, use the deferred crawling tool at https://github.com/bitextor/deferred-crawling which will download pages and produce a corpus, which is probably smaller due to link rot. This format is intended to support parties whose lawyers believe it is ok to scrape websites directly but not ok to copy them from a third party. Based on English-Polish parallel from release 9 of the ParaCrawl project, specifically "Continued Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner AI. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

DSI Relevance: BusinessRegistersInterconnectionSystem, Cybersecurity, ElectronicExchangeOfSocialSecurityInformation, Europeana, OnlineDisputeResolution, OpenDataPortal, eHealth, eJustice, eProcurement, saferInternet

Distribution

Availability: Under Review

Licences

CC0-1.0

Distribution Details

Download location : https://s3.amazonaws...

Distribution Medium: Data Downloadable

Personal Data: YES

Contact Person

Kenneth Heafield

text

Bilingual text corpusLanguages

Polish (pl)

English (en)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

TMX

Size

40,082,037 Translation Units

Character encoding

UTF-8

Resource Creation

Funding Project

Continued Web-Scale Provision of Parallel Corpora for European Languages (Paracrawl)

URL: http://paracrawl.eu/

Funding Type: Eu Funds

Funder: European Commission

Metadata

Created: 29/09/2021

Last Updated: 29/09/2021

Metadata Language: English (en)

People who looked at this resource also viewed the following:

Resources from the same project