v0.9.0
⭐ Highlights
Long-Form Question Answering (LFQA)
Haystack now provides LFQA with a Seq2SeqGenerator for generative QA and a Retribert Retriever thanks to community member @vblagoje. #1086
If you would like to ask questions where the answer is not a short phrase explicitly given in one of the documents but a more elaborate answer than LFQA is interesting for you. These elaborate answers are generated by combining information from multiple relevant documents.
Document Re-Ranking
For pure "semantic document search" use cases that do not need question answering functionality but only document ranking, there is now a new type of node: Ranker. While the Retriever is a perfect fit for document retrieval, we can further improve its results with the Ranker. #1025
To this end, the Ranker uses a pre-trained model to calculate the semantic similarity of the question and each of the top-k retrieved documents. Documents with a high semantic similarity are ranked higher. The combination of a Retriever and Ranker is especially powerful if you combine a sparse retriever, e.g., ElasticsearchRetriever based on BM25 and a dense Ranker.
A pipeline with a Ranker and Retriever can be setup in just a few lines of code:
...
retriever = ElasticsearchRetriever(document_store=document_store)
ranker = FARMRanker(model_name_or_path="deepset/gbert-base-germandpr-reranking")
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
...
Weaviate
Thanks to a contribution by our community member @venuraja79 Weaviate is integrated into Haystack as another DocumentStore #1064
It allows a combination of vector search and scalar filtering, i.e., you can filter for a certain tag and do dense retrieval on that subset. After starting a Weaviate server with docker, it's as simple as:
from haystack.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore()
Haystack uses the most recent Weaviate version 1.4.0 and the updating of embeddings has also been optimized #1181
Query Classifier
Some search applications need to distinguish between keyword queries and longer textual questions that come in. If you only want to route longer questions to the Reader branch in order to maximize the accuracy of results and minimize computation efforts/costs and route keyword queries to a Document Retriever, you can do that now with a QueryClassifier node thanks to a contribution by @shahrukhx01. #1099
You could use it as shown in this exemplary pipeline:
New Tutorials
⚠️ Breaking Changes
- Remove Python 3.6 support #1059
- Refactor REST APIs to use Pipelines #922
- Bump to FARM 0.8.0, torch 1.8.1 and transformers 4.6.1 #1192
🤓 Detailed Changes
Connector
- Add crawler to get texts from websites #775
Preprocessor
- Add white space normalization warning #1022
- Preserve whitespace during PreProcessor.split() #1121
- Fix equality check in preprocessor #969
Pipeline
- Add validation for root node in Pipeline #987
- Fix passing a list as parameter value in Pipeline YAML #952
- Add export of Pipeline YAML config #1003
- Add config to JoinDocuments node to allow yaml export in pipelines #1134
Document Stores
- Integrate Weaviate as another DocumentStore #957 #1064
- Add OpenDistro init #1101
- Rename all document stores delete_all_documents() method to delete_documents #1047
- Fix Elasticsearch connection for non-admin users #1028
- Fix update_embeddings() for FAISSDocumentStore #978
- Feature: Enable AWS Elasticsearch IAM connection #965
- Fix optional FAISS import #971
- Make FAISS import conditional #970
- Benchmark milvus #850
- Improve Milvus HNSW Performance #1127
- Update Milvus benchmarks #1128
- Upgrade milvus to 1.1.0 #1066
- Update tests for FAISSDocumentStore #999
- Add L2 support for FAISS HNSW #1138
- Improve the speed of FAISSDocumentStore.delete_documents() #1095
- Add options for handling duplicate documents (skip, fail, overwrite) #1088
- Update Embeddings - Use update instead of replace #1181
- Improve the progress bar in update_embeddings() + Fix filters in update_embeddings() #1063
- Using text hash as id to prevent document duplication #1000
Retriever
- DPR Training parameter #989
- Removed single_model_path; added infer_tokenizer to dpr load() #1060
- Integrate sentence transformers into benchmarks #843
- added use_amp to the train method, in order to use mixed precision training #1048
Ranker
- Re-ranking component for document search without QA #1025
- Remove quickfix from reader and ranker #1196
- Distinguish labels for calculating similarity scores #1124
Query Classifier
- Fix typo in Query Classifier Exception Message #1190
- Add QueryClassifier incl. baseline models #1099
Reader
- Filtering duplicate answers #1021
- Add ONNXRuntime support #157
- Remove unused function _get_pseudo_prob #1201
Generator
- Integrate LFQA with Haystack - inferencing #1086
Evaluation Nodes
- Reduce precision in pipeline eval print functions #943
- Fix division by zero error in EvalRetriever #938
- Add evaluation nodes for Pipelines #904
- Add More top_k handling to EvalDocuments #1133
- Prevent merge of same questions on different documents during evaluation #1119
REST API
- adding root_path option #982
- Add PDF converter dependencies Docker #1107
- Disable Gunicorn preload option #960
User Interface
- change file-upload response to sidebar #1018
- Add File Upload Functionality in UI #995
- Streamlit UI Evaluation mode #920
- Fix evaluation mode in UI #1024
- Fix typo in streamlit UI #1106
Documentation and Tutorials
- Add about sections to Tutorial 12 #1195
- Tutorial update #1166
- Documentation update #1162
- Add FAQ page #1151
- Refresh API docs #1152
- Add docu of confidence scores and calibration method #1131
- Adding indentation to markup files #947
- Update preprocessing.md #1087
- Add badges to readme #1136
- Regen api docs #1015
- Docs: Add usage information detailes for aws elastic search service #1008
- Add tutorial pages #1013
- Pipelines tutorial #991
- knowledge graph documentation #979
- knowledge graph example #934
- Add Milvus to the retriever / document store table #931
- New docs version #964
- Update Documentation #976
- update api markdown files and add markdown file for ranker #1198
- Reformat FAQ page #1177
- Minor change with a link to the Weaviate docs #1180
- Add links to GitHub Discussion and SO #984
- Update milvus links and docstrings #959
- Fixed link to dpr #962
- Removed comma from last item in json list #1114
- Fixing inconsistency #926
Misc
- Squad tools #1029
- Bugfix setting of device by defaulting to "cpu" #1182
- Fixing issues caused due to mypy upgrade #1165
- Remove Duplicate Benchmark Run #1132
- Fixing grpcio-tools to version of colab's pre-installed grpcio #1113
- Update farm version #936
🙏 Big thanks to all contributors! ❤️
A big thank you to all the contributors for this release: @PiffPaffM @oryx1729 @jacksbox @guillim @Timoeller @aantti @tholor @brandenchan @julian-risch @bhadreshpsavani @akkefa @mosheber @lalitpagaria @Avi777 @MichaelBitard @AlviseSembenico @shahrukhx01 @venuraja79 @bobvanluijt @vblagoje @cvgoudar
We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!