Spanish-English website parallel corpus (Processed)
See COPYRIGHT file which contains Source owners
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 21,007 TUs.
Period of crawling : 15/11/2016 - 23/01/2017
A strict validation process has been followed, which resulted in discarding:
- TUs from crawled websites that do not comply to the PSI directive,
- TUs with more than 99% of mispelled tokens,
- TUs identified during the manual validation process and all the TUs from websites whose error rate in the sample extracted for manual validation is strictly above the following thresholds:
50% of TUs with language identification errors,
50% of TUs with alignment errors,
50% of TUs with tokenization errors,
20% of TUs identified as machine translated content,
50% of TUs with translation errors.
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following:
- Spanish-English website parallel corpus
- Portuguese-English bilingual corpus from Legislation concerning the Portuguese Parliament (Processed)
- Slovenian-English corpus with statistical reports from the Statistical Office of the Republic of Slovenia website (Processed)
- Spanish-French website parallel corpus (Processed)