Apache Tika - a content analysis toolkit

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Languages: Portuguese (pt), French (fr), Finnish (fi), Italian (it), Dutch; Flemish (nl), Modern Greek (1453-) (el), English (en), Hungarian (hu), Norwegian Bokmål (nb), Swedish (sv), German (de), Spanish; Castilian (es), Icelandic (is), Polish (pl), Danish (da), Estonian (et)