Information Retrieval

Definitions

List of definitions of key types and concepts.

The project defines the way how the collection of documents is organized in the corpus and blocks.

Corpus - collection of text documents organized into blocks;
Block - subset of corpus, small enough to fit processing in memory;
Document - piece of text with metadata, the most important metadata is DocumentId;
DocumentId - unique identifier of the document.

The projects defines a number of types to process text documents organized in corpus.

Tranformer - converts a corpus of documents, preserving the structure of the corpus, but changing the presentation: texts parsing/cleaning/tokenization etc.
Indexer - builds an index from a corpus.
Token is a tuple of term, document id and term's position in the document.
BuildableIndex is a type used to build an index out of list of tokens, created SearchableIndex_.
SearchableIndex supports search for a term in the corpus.
Boolean Search Engine - performs text serching in the corpus using the index. Supports AND/OR/NOT operators.

A set of types to build a corpus from a Wikipedia's dump.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.ionide		.ionide
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml