Skip to content

Commit

Permalink
Update Documentation for Haystack 0.5.0 (#557)
Browse files Browse the repository at this point in the history
* Add languages and preprocessing pages

* add content

* address review comments

* make link relative

* update api ref with latest docstrings

* move doc readme and update

* add generator API docs

* fix example code

* design and link fix

Co-authored-by: Malte Pietsch <[email protected]>
Co-authored-by: PiffPaffM <[email protected]>
  • Loading branch information
3 people authored Nov 6, 2020
1 parent f94603c commit 99e924a
Show file tree
Hide file tree
Showing 13 changed files with 476 additions and 207 deletions.
33 changes: 15 additions & 18 deletions docs/_src/api/api/README.md → docs/README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,33 @@
*******************************************************
# Haystack — Docstrings Generation
*******************************************************
# :ledger: Looking for the docs?
You find them here here:
#### https://haystack.deepset.ai/docs/intromd


We use Pydoc-Markdown to create markdown files from the docstrings in our code.
# :computer: How to update docs?

## Usage / Guides etc.

Will be automatically deployed with every commit to the master branch

Update docs with all latest docstrings?
=======================================
## API Reference

We use Pydoc-Markdown to create markdown files from the docstrings in our code.

### Update docstrings
Execute this in `/haystack/docs/_src/api/api`:
```
pip install 'pydoc-markdown>=3.0.0,<4.0.0'
pydoc-markdown pydoc-markdown-document-store.yml
pydoc-markdown pydoc-markdown-file-converters.yml
pydoc-markdown pydoc-markdown-preprocessor.yml
pydoc-markdown pydoc-markdown-reader.yml
pydoc-markdown pydoc-markdown-generator.yml
pydoc-markdown pydoc-markdown-retriever.yml
```

Update Docstrings of individual modules
==========================================

Every .yml file will generate a new markdown file. Run one of the following commands to generate the needed output:

- **Document store**: `pydoc-markdown pydoc-markdown-document-store.yml`
- **File converters**: `pydoc-markdown pydoc-markdown-file-converters.yml`
- **Preprocessor**: `pydoc-markdown pydoc-markdown-preprocessor.yml`
- **Reader**: `pydoc-markdown pydoc-markdown-reader.yml`
- **Retriever**: `pydoc-markdown pydoc-markdown-retriever.yml`
(Or run one of the commands above to update the docstrings only for a single module)

Configuration
============
### Configuration

Pydoc will read the configuration from a `.yml` file which is located in the current working directory. Our files contains three main sections:

Expand Down
12 changes: 9 additions & 3 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ the vector embeddings are indexed in a FAISS Index.
#### \_\_init\_\_

```python
| __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, **kwargs, ,)
| __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, **kwargs, ,)
```

**Arguments**:
Expand All @@ -137,6 +137,7 @@ For more details see:
Benchmarks: XXX
- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
or one with docs that you used in Haystack before and want to load again.
- `return_embedding`: To return document embedding

<a name="faiss.FAISSDocumentStore.write_documents"></a>
#### write\_documents
Expand Down Expand Up @@ -200,7 +201,7 @@ None
#### query\_by\_embedding

```python
| query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None) -> List[Document]
| query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
```

Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
Expand All @@ -212,6 +213,7 @@ Find the document that is most similar to the provided `query_emb` by using a ve
Example: {"name": ["some", "more"], "category": ["only_one"]}
- `top_k`: How many documents to return
- `index`: (SQL) index name for storing the docs and metadata
- `return_embedding`: To return document embedding

**Returns**:

Expand Down Expand Up @@ -271,7 +273,7 @@ class ElasticsearchDocumentStore(BaseDocumentStore)
#### \_\_init\_\_

```python
| __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30)
| __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, analyzer: str = "standard", scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30, return_embedding: Optional[bool] = True)
```

A DocumentStore using Elasticsearch to store and query the documents for our search.
Expand All @@ -294,6 +296,9 @@ If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will
- `embedding_field`: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
- `embedding_dim`: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
- `custom_mapping`: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.
- `analyzer`: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index.
Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html
- `excluded_meta_data`: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]).
Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).
- `scheme`: 'https' or 'http', protocol used to connect to your elasticsearch instance
Expand All @@ -312,6 +317,7 @@ More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-re
- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
- `timeout`: Number of seconds after which an ElasticSearch request times out.
- `return_embedding`: To return document embedding

<a name="elasticsearch.ElasticsearchDocumentStore.write_documents"></a>
#### write\_documents
Expand Down
23 changes: 12 additions & 11 deletions docs/_src/api/api/preprocessor.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#### eval\_data\_from\_file

```python
eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]
eval_data_from_file(filename: str, max_docs: Union[int, bool] = None) -> Tuple[List[Document], List[Label]]
```

Read Documents + Labels from a SQuAD-style file.
Expand All @@ -14,6 +14,7 @@ Document and Labels can then be indexed to the DocumentStore and be used for eva
**Arguments**:

- `filename`: Path to file in SQuAD format
- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.

**Returns**:

Expand Down Expand Up @@ -97,27 +98,27 @@ class PreProcessor(BasePreProcessor)
#### \_\_init\_\_

```python
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "passage", split_length: Optional[int] = 10, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = False)
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
```

**Arguments**:

- `clean_header_footer`: use heuristic to remove footers and headers across different pages by searching
- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `clean_whitespace`: strip whitespaces before or after each line in the text.
- `clean_empty_lines`: remove more than two empty lines in the text.
- `split_by`: split the document by "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: n number of splits to merge as a single document. For instance, if n -> 10 & split_by ->
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
- `clean_empty_lines`: Remove more than two empty lines in the text.
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
"sentence", then each output document will have 10 sentences.
- `split_stride`: length of striding window over the splits. For example, if split_by -> `word`,
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
split_length -> 5 & split_stride -> 2, then the splits would be like:
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
Set the value to None to disable striding behaviour.
- `split_respect_sentence_boundary`: whether to split in partial sentences when if split_by -> `word`. If set
to True, the individual split would always have complete sentence &
the number of words being less than or equal to the split_length.
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
to True, the individual split will always have complete sentences &
the number of words will be <= split_length.

<a name="base"></a>
# base
Expand Down
14 changes: 14 additions & 0 deletions docs/_src/api/api/pydoc-markdown-generator.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
loaders:
- type: python
search_path: [../../../../haystack/generator]
ignore_when_discovered: ['__init__']
processor:
- type: filter
expression: not name.startswith('_') and default()
- documented_only: true
- do_not_filter_modules: false
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: false
filename: generator.md
4 changes: 3 additions & 1 deletion docs/_src/api/api/reader.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,9 @@ See https://huggingface.co/models for full list of available QA models

**Arguments**:

- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased', 'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'. See https://huggingface.co/models for full list of available models.
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
See https://huggingface.co/models for full list of available models.
- `tokenizer`: Name of the tokenizer (usually the same as model)
- `context_window_size`: Num of chars (before and after the answer) to return as "context" for each answer.
The context usually helps users to understand if the answer really makes sense.
Expand Down
8 changes: 5 additions & 3 deletions docs/_src/api/api/retriever.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Que
#### \_\_init\_\_

```python
| __init__(document_store: BaseDocumentStore, query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base", max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product")
| __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product")
```

Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
Expand Down Expand Up @@ -327,8 +327,10 @@ position in the ranking of documents the correct document is.
| Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mean avg precision": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank.
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank. Considers all retrieved relevant documents. (only with ``open_domain=False``)

**Arguments**:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
<!---
title: "Database"
title: "Document Store"
metaTitle: "Document Store"
metaDescription: ""
slug: "/docs/database"
slug: "/docs/documentstore"
date: "2020-09-03"
id: "databasemd"
id: "documentstoremd"
--->


# Document Stores
# DocumentStores

You can think of the Document Store as a "database" that:
You can think of the DocumentStore as a "database" that:
- stores your texts and meta data
- provides them to the retriever at query time

There are different DocumentStores in Haystack to fit different use cases and tech stacks.

## Initialisation

Initialising a new Document Store is straight forward.
Initialising a new DocumentStore is straight forward.

<div class="tabs tabsdsinstall">

Expand Down Expand Up @@ -75,10 +75,13 @@ document_store = SQLDocumentStore()
Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes.
See API documentation for more info.

## Preparing Documents
## Input Format

DocumentStores expect Documents in dictionary form, like that below.
They are loaded using the `DocumentStore.write_documents()` method.
See [Preprocessing](/docs/latest/preprocessingmd) for more information on how to best prepare your data.

[//]: # (Add link to preprocessing section)

```python
document_store = ElasticsearchDocumentStore()
Expand All @@ -91,28 +94,9 @@ dicts = [
document_store.write_documents(dicts)
```

## File Conversion

There are a range of different file converters in Haystack that can help get your data into the right format.
Haystack features support for txt, pdf and docx formats and there is even a converted that leverages Apache Tika.
See the File Converters section in the API docs for more information.

<!-- _comment: !! Code snippets for each type !! -->
Haystack also has a `convert_files_to_dicts()` utility function that will convert
all txt or pdf files in a given folder into this dictionary format.

```python
document_store = ElasticsearchDocumentStore()
dicts = convert_files_to_dicts(dir_path=doc_dir)
document_store.write_documents(dicts)
```

## Writing Documents
## Writing Documents (Sparse Retrievers)

Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low.

### For Sparse Retrievers

For **sparse**, keyword based retrievers such as BM25 and TF-IDF,
you simply have to call `DocumentStore.write_documents()`.
The creation of the inverted index which optimises querying speed is handled automatically.
Expand All @@ -121,7 +105,7 @@ The creation of the inverted index which optimises querying speed is handled aut
document_store.write_documents(dicts)
```

### For Dense Retrievers
## Writing Documents (Dense Retrievers)

For **dense** neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval,
indexing involves computing the Document embeddings which will be compared against the Query embedding.
Expand All @@ -139,9 +123,9 @@ Having GPU acceleration will significantly speed this up.

<!-- _comment: !! Diagrams of inverted index / document embeds !! -->
<!-- _comment: !! Make this a tab element to show how different datastores are initialized !! -->
## Choosing the right document store
## Choosing the Right Document Store

The Document stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:
The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:

<div class="tabs tabsdschoose">

Expand Down Expand Up @@ -213,7 +197,7 @@ The Document stores have different characteristics. You should choose one depend

</div>

#### Our recommendations
#### Our Recommendations

**Restricted environment:** Use the `InMemoryDocumentStore`, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases

Expand Down
Loading

0 comments on commit 99e924a

Please sign in to comment.