Gensim is a free Python library designed to automatically extract semantic topics from documents. Gensim is designed to process raw, unstructured digital texts (“plain text”).
The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Analysis (LSI, LSA, see LsiModel), Latent Dirichlet Allocation (LDA, see LdaModel) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.
Once these statistical patterns are found, any plain text documents (sentence, phrase, word…) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases…).
People who looked at this resource also viewed the following:
- GATE -- a full-lifecycle open source solution for text processing
- German-English parallel data by the Presidency of the Council of the EU held by Luxembourg in 2015
- German-English parallel data by the Presidency of the Council of the EU held by Austria in 2006
- German-English parallel corpus from CORDIS Project Results in Brief