Update Documentation for Haystack 0.5.0 (#557)

* Add languages and preprocessing pages * add content * address review comments * make link relative * update api ref with latest docstrings * move doc readme and update * add generator API docs * fix example code * design and link fix Co-authored-by: Malte Pietsch <[email protected]> Co-authored-by: PiffPaffM <[email protected]>
deepset-ai · Nov 6, 2020 · 99e924a · 99e924a
1 parent f94603c
commit 99e924a
Show file tree

Hide file tree

Showing 13 changed files with 476 additions and 207 deletions.
diff --git a/docs/_src/api/api/README.md → docs/README.md b/docs/_src/api/api/README.md → docs/README.md
@@ -1,36 +1,33 @@
-*******************************************************
-# Haystack — Docstrings Generation
-*******************************************************
+# :ledger: Looking for the docs?
+You find them here here: 
+#### https://haystack.deepset.ai/docs/intromd
 
 
-We use Pydoc-Markdown to create markdown files from the docstrings in our code.
+# :computer: How to update docs?
+
+## Usage / Guides etc.
 
+Will be automatically deployed with every commit to the master branch
 
-Update docs with all latest docstrings?
-=======================================
+## API Reference 
+
+We use Pydoc-Markdown to create markdown files from the docstrings in our code.
+
+### Update docstrings
 Execute this in `/haystack/docs/_src/api/api`:
 ```
 pip install 'pydoc-markdown>=3.0.0,<4.0.0'
 pydoc-markdown pydoc-markdown-document-store.yml
 pydoc-markdown pydoc-markdown-file-converters.yml
 pydoc-markdown pydoc-markdown-preprocessor.yml
 pydoc-markdown pydoc-markdown-reader.yml
+pydoc-markdown pydoc-markdown-generator.yml
 pydoc-markdown pydoc-markdown-retriever.yml
 ```
 
-Update Docstrings of individual modules
-==========================================
-
-Every .yml file will generate a new markdown file. Run one of the following commands to generate the needed output:
-
-- **Document store**: `pydoc-markdown pydoc-markdown-document-store.yml`
-- **File converters**: `pydoc-markdown pydoc-markdown-file-converters.yml`
-- **Preprocessor**: `pydoc-markdown pydoc-markdown-preprocessor.yml`
-- **Reader**: `pydoc-markdown pydoc-markdown-reader.yml`
-- **Retriever**: `pydoc-markdown pydoc-markdown-retriever.yml`
+(Or run one of the commands above to update the docstrings only for a single module)
 
-Configuration
-============
+### Configuration
 
 Pydoc will read the configuration from a `.yml` file which is located in the current working directory. Our files contains three main sections:
 

diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
@@ -110,7 +110,7 @@ the vector embeddings are indexed in a FAISS Index.
 #### \_\_init\_\_
 
 ```python
- | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, **kwargs, ,)
+ | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, **kwargs, ,)
 ```
 
 **Arguments**:
@@ -137,6 +137,7 @@ For more details see:
 Benchmarks: XXX
 - `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually
 or one with docs that you used in Haystack before and want to load again.
+- `return_embedding`: To return document embedding
 
 <a name="faiss.FAISSDocumentStore.write_documents"></a>
 #### write\_documents
@@ -200,7 +201,7 @@ None
 #### query\_by\_embedding
 
 ```python
- | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None) -> List[Document]
+ | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document]
 ```
 
 Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
@@ -212,6 +213,7 @@ Find the document that is most similar to the provided `query_emb` by using a ve
 Example: {"name": ["some", "more"], "category": ["only_one"]}
 - `top_k`: How many documents to return
 - `index`: (SQL) index name for storing the docs and metadata
+- `return_embedding`: To return document embedding
 
 **Returns**:
 
@@ -271,7 +273,7 @@ class ElasticsearchDocumentStore(BaseDocumentStore)
 #### \_\_init\_\_
 
 ```python
- | __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30)
+ | __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, analyzer: str = "standard", scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30, return_embedding: Optional[bool] = True)
 ```
 
 A DocumentStore using Elasticsearch to store and query the documents for our search.
@@ -294,6 +296,9 @@ If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will
 - `embedding_field`: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
 - `embedding_dim`: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
 - `custom_mapping`: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.
+- `analyzer`: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index.
+Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
+https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html
 - `excluded_meta_data`: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]).
 Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).
 - `scheme`: 'https' or 'http', protocol used to connect to your elasticsearch instance
@@ -312,6 +317,7 @@ More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-re
 - `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is
 more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
 - `timeout`: Number of seconds after which an ElasticSearch request times out.
+- `return_embedding`: To return document embedding
 
 <a name="elasticsearch.ElasticsearchDocumentStore.write_documents"></a>
 #### write\_documents

diff --git a/docs/_src/api/api/preprocessor.md b/docs/_src/api/api/preprocessor.md
@@ -5,7 +5,7 @@
 #### eval\_data\_from\_file
 
 ```python
-eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]
+eval_data_from_file(filename: str, max_docs: Union[int, bool] = None) -> Tuple[List[Document], List[Label]]
 ```
 
 Read Documents + Labels from a SQuAD-style file.
@@ -14,6 +14,7 @@ Document and Labels can then be indexed to the DocumentStore and be used for eva
 **Arguments**:
 
 - `filename`: Path to file in SQuAD format
+- `max_docs`: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
 
 **Returns**:
 
@@ -97,27 +98,27 @@ class PreProcessor(BasePreProcessor)
 #### \_\_init\_\_
 
 ```python
- | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "passage", split_length: Optional[int] = 10, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = False)
+ | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
 ```
 
 **Arguments**:
 
-- `clean_header_footer`: use heuristic to remove footers and headers across different pages by searching
+- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
 for the longest common string. This heuristic uses exact matches and therefore
 works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
 or similar.
-- `clean_whitespace`: strip whitespaces before or after each line in the text.
-- `clean_empty_lines`: remove more than two empty lines in the text.
-- `split_by`: split the document by "word", "sentence", or "passage". Set to None to disable splitting.
-- `split_length`: n number of splits to merge as a single document. For instance, if n -> 10 & split_by ->
+- `clean_whitespace`: Strip whitespaces before or after each line in the text.
+- `clean_empty_lines`: Remove more than two empty lines in the text.
+- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
+- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
 "sentence", then each output document will have 10 sentences.
-- `split_stride`: length of striding window over the splits. For example, if split_by -> `word`,
+- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
 split_length -> 5 & split_stride -> 2, then the splits would be like:
 [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
 Set the value to None to disable striding behaviour.
-- `split_respect_sentence_boundary`: whether to split in partial sentences when if split_by -> `word`. If set
-to True, the individual split would always have complete sentence &
-the number of words being less than or equal to the split_length.
+- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
+to True, the individual split will always have complete sentences &
+the number of words will be <= split_length.
 
 <a name="base"></a>
 # base

diff --git a/docs/_src/api/api/pydoc-markdown-generator.yml b/docs/_src/api/api/pydoc-markdown-generator.yml
@@ -0,0 +1,14 @@
+loaders:
+  - type: python
+    search_path: [../../../../haystack/generator]
+    ignore_when_discovered: ['__init__']
+processor:
+  - type: filter
+    expression: not name.startswith('_') and default()
+  - documented_only: true
+  - do_not_filter_modules: false
+  - skip_empty_modules: true
+renderer:
+  type: markdown
+  descriptive_class_title: false
+  filename: generator.md
diff --git a/docs/_src/api/api/reader.md b/docs/_src/api/api/reader.md
@@ -314,7 +314,9 @@ See https://huggingface.co/models for full list of available QA models
 
 **Arguments**:
 
-- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased', 'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'. See https://huggingface.co/models for full list of available models.
+- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
+'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
+See https://huggingface.co/models for full list of available models.
 - `tokenizer`: Name of the tokenizer (usually the same as model)
 - `context_window_size`: Num of chars (before and after the answer) to return as "context" for each answer.
 The context usually helps users to understand if the answer really makes sense.

diff --git a/docs/_src/api/api/retriever.md b/docs/_src/api/api/retriever.md
@@ -94,7 +94,7 @@ Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Que
 #### \_\_init\_\_
 
 ```python
- | __init__(document_store: BaseDocumentStore, query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base", max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product")
+ | __init__(document_store: BaseDocumentStore, query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base", max_seq_len_query: int = 64, max_seq_len_passage: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, use_fast_tokenizers: bool = True, similarity_function: str = "dot_product")
 ```
 
 Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
@@ -327,8 +327,10 @@ position in the ranking of documents the correct document is.
 |  Returns a dict containing the following metrics:
 
 - "recall": Proportion of questions for which correct document is among retrieved documents
-- "mean avg precision": Mean of average precision for each question. Rewards retrievers that give relevant
-documents a higher rank.
+- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
+Only considers the highest ranked relevant document.
+- "map": Mean of average precision for each question. Rewards retrievers that give relevant
+documents a higher rank. Considers all retrieved relevant documents. (only with ``open_domain=False``)
 
 **Arguments**:
 

diff --git a/docs/_src/usage/usage/database.md → docs/_src/usage/usage/document_store.md b/docs/_src/usage/usage/database.md → docs/_src/usage/usage/document_store.md
@@ -1,24 +1,24 @@
 <!---
-title: "Database"
+title: "Document Store"
 metaTitle: "Document Store"
 metaDescription: ""
-slug: "/docs/database"
+slug: "/docs/documentstore"
 date: "2020-09-03"
-id: "databasemd"
+id: "documentstoremd"
 --->
 
 
-# Document Stores
+# DocumentStores
 
-You can think of the Document Store as a "database" that:
+You can think of the DocumentStore as a "database" that:
 - stores your texts and meta data  
 - provides them to the retriever at query time 
 
 There are different DocumentStores in Haystack to fit different use cases and tech stacks. 
 
 ## Initialisation
 
-Initialising a new Document Store is straight forward.
+Initialising a new DocumentStore is straight forward.
 
 <div class="tabs tabsdsinstall">
 
@@ -75,10 +75,13 @@ document_store = SQLDocumentStore()
 Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes.
 See API documentation for more info.
 
-## Preparing Documents
+## Input Format
 
 DocumentStores expect Documents in dictionary form, like that below.
 They are loaded using the `DocumentStore.write_documents()` method.
+See [Preprocessing](/docs/latest/preprocessingmd) for more information on how to best prepare your data.
+
+[//]: # (Add link to preprocessing section)
 
 ```python
 document_store = ElasticsearchDocumentStore()
@@ -91,28 +94,9 @@ dicts = [
 document_store.write_documents(dicts)
 ```
 
-## File Conversion
-
-There are a range of different file converters in Haystack that can help get your data into the right format.
-Haystack features support for txt, pdf and docx formats and there is even a converted that leverages Apache Tika.
-See the File Converters section in the API docs for more information.
-
-<!-- _comment: !! Code snippets for each type !! -->
-Haystack also has a `convert_files_to_dicts()` utility function that will convert
-all txt or pdf files in a given folder into this dictionary format.
-
-```python
-document_store = ElasticsearchDocumentStore()
-dicts = convert_files_to_dicts(dir_path=doc_dir)
-document_store.write_documents(dicts)
-```
-
-## Writing Documents
+## Writing Documents (Sparse Retrievers)
 
 Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low.
-
-### For Sparse Retrievers
-
 For **sparse**, keyword based retrievers such as BM25 and TF-IDF,
 you simply have to call `DocumentStore.write_documents()`.
 The creation of the inverted index which optimises querying speed is handled automatically.
@@ -121,7 +105,7 @@ The creation of the inverted index which optimises querying speed is handled aut
 document_store.write_documents(dicts)
 ```
 
-### For Dense Retrievers
+## Writing Documents (Dense Retrievers)
 
 For **dense** neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval,
 indexing involves computing the Document embeddings which will be compared against the Query embedding.
@@ -139,9 +123,9 @@ Having GPU acceleration will significantly speed this up.
 
 <!-- _comment: !! Diagrams of inverted index / document embeds !! -->
 <!-- _comment: !! Make this a tab element to show how different datastores are initialized !! -->
-## Choosing the right document store
+## Choosing the Right Document Store
 
-The Document stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment: 
+The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment: 
 
 <div class="tabs tabsdschoose">
 
@@ -213,7 +197,7 @@ The Document stores have different characteristics. You should choose one depend
 
 </div>
 
-#### Our recommendations
+#### Our Recommendations
 
 **Restricted environment:** Use the `InMemoryDocumentStore`, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases