-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Top K retrieval with bias for recent docs #33251
Comments
All documents that matches the query will be ranked by the first-phase ranking function, which can express any recency bias etc. as a mathematical expression, e.g. However, as an optimization, some query operators do their own internal scoring to determine which documents they match, and here it might be a recency bias may either lead to better results (by not filtering out documents that will be scored higher by a recency-biased first-phase function), or better performance (by matching fewer documents that will be score low by it). One such query operator is weakAnd. Vespa's weakAnd will (in contrast to standard wand) match all documents that scores the same. This ensures that a lack of recency bias inside weakAnd will not degrade the result quality. For example, given a query for "Joe Biden" in a news corpus, all documents containing both those terms will be matched, which ensures that the final result set will contain all the newest documents containing "Joe Biden" if a recency-biased rank function like the above is used. However, a recency bias could still potentially improve efficiency, by exposing fewer of these matches to the first-phase function. Vespa also provides a feature specifically for adaptively limiting matches by some attribute, match-phase. This might work well in improving efficiency further in this case by limiting the number of hits exposed to weakAnd. Lastly, we recently introduced new tuning parameters for weakAnd which lets it adaptively limit the number of matches base on term statistics: https://docs.vespa.ai/en/reference/schema-reference.html#weakand That can be used to further improve efficiency here. (A detailed article on this is coming shortly.) In summary:
|
Thanks for the explanations. A few questions below: Regarding the statement:
If I understand correctly, there could be many documents all with the same match score such that documents falling inside the top-k and outside of it, all have the same score. In this scenario, Vespa's weakAnd will still consider all such docs in first-phase ranking regardless of what My next question is why would there be so many documents with the exact same score? Likely because Vespa's weakAnd term significance score is not dependent on individual document level statistics like term frequency? Rather, it is based on index wide term statistics (similar to IDF) which don't change depending on the current document being scored. Is my understanding correct here? The solutions you mention make sense for queries using Vespa's weakAnd. However, what would the solution be for queries using regular |
Is your feature request related to a problem? Please describe.
I don't believe there is a way for the retrieval/matching phase to have a bias towards recency without using strict filters. Is there a way for the matching function to take into account both lexical match and recency?
As an example, imagine a News index that contains 100M docs and queries that may match a large % of them relative to other queries (e.g., queries with political figures that appear in the news often). For those types of queries that tend to lexically match many docs, I want to ensure that there is some bias towards recency so that later ranking phases actually see relevant documents (both textually relevant and recent) in the top K returned.
Describe the solution you'd like
Is there a way to use the
rank
function where I boost lexical matches by a simple recency function? Ideally, this would be incorporated into weakAnd scoring to allow for efficient query execution.Lucene implemented something related some time ago: https://issues.apache.org/jira/browse/LUCENE-8340
Both Elasticserach and OpenSearch also have distance_feature_query which allows for efficient use of a recency feature to top K retrieval: https://www.elastic.co/blog/distance-feature-query-time-and-geo-in-elasticsearch-result-ranking
The text was updated successfully, but these errors were encountered: