The Law on Higher Education and Science 2018 (Processed)
EN-PL dataset in TMX format (2679 TUs) extracted from the Act: The Law on Higher Education and Science (2018) laying down the rules for the functioning of higher education and science in the Republic of Poland.
First, text extraction from PDF was applied. Then, custom scripts were developed and applied with the aim of keeping the proper layout (i.e. merge text of a paragraph that was spread in two pages). Multilingual embedding were used to identify pairs of parallel sentences/segments. Finally, methods of parallel corpus filtering were applied to remove duplicates of translation units of limited or no use.
DSI Relevance: OpenDataPortal