PKN Orlen Dataset (Processed)

87 Last view: 2025-05-23

2 Last update: 2019-01-16

17 Last download: 2024-07-20

PKN Orlen Dataset (Processed)

Attribution details: PKN Orlen Dataset was created for the European Language Resources Coordination Action (ELRC) (http://lr-coordination.eu/) by Ogrodniczuk Maciej, Institute of Computer Science, Polish Academy of Sciences with primary data copyrighted by PKN Orlen and is licensed under "CC-BY 4.0" (https://creativecommons.org/licenses/by/4.0/).

Dataset of the Polish public sector company PKN Orlen, a major Polish oil refiner and petrol retailer. The dataset comprises 4 Polish-English files in XLIFF format, 100K word tokens in total:
1. Orlen Annual Report 2015
2. Gas (R)evolution in Poland
3. Orlen Shale Gas Report
4. European Energy Union
The texts were aligned semi-automatically at the level of translation segments (mostly sentences and short paragraphs) and manually verified. They are available in the XLiFF format, which preserves the original order of the aligned segments.
It was converted into a 2343-TUs English-Polish resource in TMX format.

DSI Relevance: BusinessRegistersInterconnectionSystem

Distribution

Availability: Available

Licences

CC-BY-4.0

Conditions: Attribution

Distribution Details

Attribution Details: PKN Orlen Dataset was created for the European Language Resources Coordination Action (ELRC) (http://lr-coordination.eu/) by Ogrodniczuk Maciej, Institute of Computer Science, Polish Academy of Sciences with primary data copyrighted by PKN Orlen and is licensed under "CC-BY 4.0" (https://creativecommons.org/licenses/by/4.0/).

IPR Holders

PKN ORLEN

Contact Person

Maciej Ogrodniczuk

text

Bilingual text corpusLanguages

Polish (pl)

English (en)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

TMX

Size

2,343 Translation Units

Character encoding

UTF-8

Domains

ENERGY (Eurovoc 66)

AnnotationAlignment

Segmentation level: Paragraph, Sentence

Creation

Creation mode details: The dataset was provided as a collection of four xlf files. They were merged to a TMX file. As a post-processing task several filters were applied to discard/annotate alignments that might be incorrect or of limited use for training MT systems.

Creation mode: Automatic

Resource Creation

Funding Project

European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation-LOT3 (SMART 2015/1091-30-CE-0816766/00-92))

URL: http://www.lr-coordi...

Funding Type: Service Contract

Funder: European Commission

Funding Country: European Union (EU)

Project duration: 13/12/2016 - 12/02/2020

Metadata

Created: 19/09/2016

Last Updated: 15/12/2016

Metadata Language: English (en)

Metadata Creator

Vassilis Papavassiliou