Shallow Processing with Unification and Typed Feature Structures
SProUT is DFKI LT Lab's linguistic army knife, a flexible multi-purpose engine for domain-independent and domain-specific multilingual NLP tasks such as structured named entity recognition, information extraction, opinion mining, ontology extraction from text, and many more.
SProUT is also a platform for development of multilingual shallow text processing and information extraction systems.
It consists of several reusable Unicode-capable online linguistic processing components for basic linguistic operations ranging from tokenization to coreference matching. Since typed feature structures (TFS) are used as a uniform data structure for representing the input and output by each of these processing resources, they can be flexibly combined into a pipeline that produces several streams of linguistically annotated structures, which serve as an input for the shallow grammar interpreter, applied at the next stage.
The grammar formalism in SProUT, called XTDL is a blend of very efficient finite-state techniques and unification-based formalisms which are known to guarantee transparency and expressiveness. A grammar in SProUT consists of pattern/action rules, where the LHS of a rule is a regular expression over TFSs with functional operators and coreferences, representing the recognition pattern, and the RHS of a rule is a TFS specification of the output structure.
Furthermore, SProUT comes with an integrated grammar development and testing environment.
Currently, the platform provides linguistic processing resources for several languages including among other English, German, French, Italian, Durch, Spanish, Polish, Czech, Chinese, and Japanese.
Languages: German (de), French (fr), Italian (it), Dutch; Flemish (nl), Spanish; Castilian (es), Polish (pl), Czech (cs), Japanese (ja), English (en), Chinese (zh)
People who looked at this resource also viewed the following:
- Corpus of Icelandic texts from the Central Bank of Iceland (Processed)
- Bilingual English-Icelandic parallel corpus from the official Nordic cooperation website
- Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-HU)
- Bilingual English-Norwegian (Nynorsk) parallel corpus from the Courts of Norway website