Shallow Processing with Unification and Typed Feature Structures


SProUT is DFKI LT Lab's linguistic army knife, a flexible multi-purpose engine for domain-independent and domain-specific multilingual NLP tasks such as structured named entity recognition, information extraction, opinion mining, ontology extraction from text, and many more.
SProUT is also a platform for development of multilingual shallow text processing and information extraction systems.
It consists of several reusable Unicode-capable online linguistic processing components for basic linguistic operations ranging from tokenization to coreference matching. Since typed feature structures (TFS) are used as a uniform data structure for representing the input and output by each of these processing resources, they can be flexibly combined into a pipeline that produces several streams of linguistically annotated structures, which serve as an input for the shallow grammar interpreter, applied at the next stage.
The grammar formalism in SProUT, called XTDL is a blend of very efficient finite-state techniques and unification-based formalisms which are known to guarantee transparency and expressiveness. A grammar in SProUT consists of pattern/action rules, where the LHS of a rule is a regular expression over TFSs with functional operators and coreferences, representing the recognition pattern, and the RHS of a rule is a TFS specification of the output structure.
Furthermore, SProUT comes with an integrated grammar development and testing environment.
Currently, the platform provides linguistic processing resources for several languages including among other English, German, French, Italian, Durch, Spanish, Polish, Czech, Chinese, and Japanese.

Languages: German (de), French (fr), Italian (it), Dutch; Flemish (nl), Spanish; Castilian (es), Polish (pl), Czech (cs), Japanese (ja), English (en), Chinese (zh)