Manufactured data based on ParaCrawl 8 v2

A synthetic corpus created by substituting words from dictionaries into the ParaCrawl 8 corpora. The process for doing so and dictionaries have been improved since the first release based on ParaCrawl 8. (1) We identify foreign words that are rare in Paracrawl, defined as:
- occurs 4-20 times in the corpus
- most probable translation has at least 30% probability mass
(based on fast_align alignments)
- English translation is not identical with the word
- English translation occurs not more than 100 times
- occurs at lest 50 times in CommonCrawl; same for translation

(2) We use the synthesis tool to generate artificial sentence pairs, to wit:
- foreign word and replacement word have similar monolingual word embedding;
same for English translation and English translation word
- at most 1000 sentence pairs are generated from a word pair

Synthetic parallel corpora are created for the 8 lowest resource languages in
the official Paracrawl release:
Estonian (et), Irish (ga), Croation (hr), Icelandic (is), Latvian (lt),
Lithuanian (lv), Maltese (mt), Slovene (sl).

