Spanish-Italian website parallel corpus (Processed)

65 Last view: 2025-08-10

1 Last update: 2020-04-13

22 Last download: 2025-08-03

Spanish-Italian website parallel corpus (Processed)

Attribution details: See COPYRIGHT file, which contains Source Owner

This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 3,319 TUs.
Date of crawling : 23/01/2017
A strict validation process was already followed for the source data, which resulted in discarding:
- TUs from crawled websites that do not comply to the PSI directive,
- TUs with more than 99% of mispelled tokens,
- TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds:
50% of TUs with language identification errors,
50% of TUs with alignment errors,
50% of TUs with tokenization errors,
20% of TUs identified as machine translated content,
50% of TUs with translation errors.

Distribution

Availability: Available

Licences

Open Under-PSI

Used for resources that fall under the scope of PSI (Public Sector Information) regulations, and for which no further information is required or available. For more information on the EU legislation on the reuse of Public Sector Information, see here: https://ec.europa.eu/digital-single-market/en/european-legislation-reuse-public-sector-information.

Distribution Details

Attribution Details: See COPYRIGHT file, which contains Source Owner

IPR Holders

see COPYRIGHT file

Contact Person

Arranz Victoria

text

Bilingual text corpusLanguages

Spanish; Castilian (es) (79,122 Tokens)

Italian (it) (71,272 Tokens)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

TMX

Size

3,319 Translation Units

Character encoding

UTF-8

Resource Creation

Created using ELRC Services

Funding Project

European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))

URL: http://www.lr-coordi...

Funding Type: Service Contract

Funder: European Commission

Funding Country: European Union (EU)

Project duration: 13/12/2016 - 12/02/2020

Metadata

Created: 28/03/2017

Last Updated: 11/04/2019

Metadata Language: English (en)

Metadata Creator

Victoria Arranz