Spanish-Italian website parallel corpus (Processed) 
See COPYRIGHT file, which contains Source Owner
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 3,319 TUs.
Date of crawling : 23/01/2017
A strict validation process was already followed for the source data, which resulted in discarding:
- TUs from crawled websites that do not comply to the PSI directive,
- TUs with more than 99% of mispelled tokens,
- TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds:
50% of TUs with language identification errors,
50% of TUs with alignment errors,
50% of TUs with tokenization errors,
20% of TUs identified as machine translated content,
50% of TUs with translation errors.
People who looked at this resource also viewed the following:
- Compilation of Estonian-Maltese parallel corpora resources used for training of NTEU Machine Translation engines. Tier 3.
- COVID-19 WIPO dataset v1. Bilingual (EN-FR)
- Compilation of Modern Greek (1453-)-Italian parallel corpora resources used for training of NTEU Machine Translation engines. Tier 3.
- Compilation of Latvian-Polish parallel corpora resources used for training of NTEU Machine Translation engines.