List of definitions of key types and concepts.
The project defines the way how the collection of documents is organized in the corpus and blocks.
- Corpus - collection of text documents organized into blocks;
- Block - subset of corpus, small enough to fit processing in memory;
- Document - piece of text with metadata, the most important metadata is DocumentId;
- DocumentId - unique identifier of the document.
The projects defines a number of types to process text documents organized in corpus.
- Tranformer - converts a corpus of documents, preserving the structure of the corpus, but changing the presentation: texts parsing/cleaning/tokenization etc.
- Indexer - builds an index from a corpus.
- Token is a tuple of term, document id and term's position in the document.
- BuildableIndex is a type used to build an index out of list of tokens, created SearchableIndex_.
- SearchableIndex supports search for a term in the corpus.
- Boolean Search Engine - performs text serching in the corpus using the index. Supports AND/OR/NOT operators.
A set of types to build a corpus from a Wikipedia's dump.