Translation Memory (TM) cleaning service offered within the CEF Data Marketplace platform, aimed to remove wrong or dirty translation units (TUs) from the TMs uploaded to the Marketplace. The TM-cleaner is based on the sentence embeddings provided by the LASER suite. Given a TU, the sentence embeddings are extracted for both the source and target sentences, each with respect to their own language. Then the cosyne similarity between the source embeddings and the target embeddings is computed: if it reaches a given threshold then the TU is labeled as clean, otherwise as dirty.
It is worth noticing that LASER is able to manage at least 93 languages, giving the tool the ability to support multilinguality.