DSI-enriched ParaCrawl 9 en-nl corpus

This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the affinity of each segment pair to a specific Digital Service Infrastructure (DSI), which includes Cybersecurity, Electronic Exchange of Social Security Information, E-health, E-justice, Europeana, Online Dispute Resolution, Open Data Portal and Safer Internet. The model that assigned the probabilities is a fine-tuned pre-trained language model (DeBERTa-v3-large), trained on a crawled corpus of English DSI-specific texts. More information is available on the corresponding GitHub page: https://github.com/RikVN/DSI. The rest of the information in the original version of the corpus remained unchanged.

This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

DSI Relevance: BusinessRegistersInterconnectionSystem, Cybersecurity, ElectronicExchangeOfSocialSecurityInformation, Europeana, OnlineDisputeResolution, OpenDataPortal, eHealth, eJustice, eProcurement, saferInternet