Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top K retrieval with bias for recent docs #33251

Open
arasraj opened this issue Feb 3, 2025 · 2 comments
Open

Top K retrieval with bias for recent docs #33251

arasraj opened this issue Feb 3, 2025 · 2 comments

Comments

@arasraj
Copy link

arasraj commented Feb 3, 2025

Is your feature request related to a problem? Please describe.
I don't believe there is a way for the retrieval/matching phase to have a bias towards recency without using strict filters. Is there a way for the matching function to take into account both lexical match and recency?

As an example, imagine a News index that contains 100M docs and queries that may match a large % of them relative to other queries (e.g., queries with political figures that appear in the news often). For those types of queries that tend to lexically match many docs, I want to ensure that there is some bias towards recency so that later ranking phases actually see relevant documents (both textually relevant and recent) in the top K returned.

Describe the solution you'd like
Is there a way to use the rank function where I boost lexical matches by a simple recency function? Ideally, this would be incorporated into weakAnd scoring to allow for efficient query execution.

Lucene implemented something related some time ago: https://issues.apache.org/jira/browse/LUCENE-8340

Both Elasticserach and OpenSearch also have distance_feature_query which allows for efficient use of a recency feature to top K retrieval: https://www.elastic.co/blog/distance-feature-query-time-and-geo-in-elasticsearch-result-ranking

@bratseth
Copy link
Member

bratseth commented Feb 4, 2025

All documents that matches the query will be ranked by the first-phase ranking function, which can express any recency bias etc. as a mathematical expression, e.g. ( 0.75*bm25(title) + 0.25*bm25(body) ) * freshness(publish_time).logscale.

However, as an optimization, some query operators do their own internal scoring to determine which documents they match, and here it might be a recency bias may either lead to better results (by not filtering out documents that will be scored higher by a recency-biased first-phase function), or better performance (by matching fewer documents that will be score low by it).

One such query operator is weakAnd. Vespa's weakAnd will (in contrast to standard wand) match all documents that scores the same. This ensures that a lack of recency bias inside weakAnd will not degrade the result quality. For example, given a query for "Joe Biden" in a news corpus, all documents containing both those terms will be matched, which ensures that the final result set will contain all the newest documents containing "Joe Biden" if a recency-biased rank function like the above is used. However, a recency bias could still potentially improve efficiency, by exposing fewer of these matches to the first-phase function.

Vespa also provides a feature specifically for adaptively limiting matches by some attribute, match-phase. This might work well in improving efficiency further in this case by limiting the number of hits exposed to weakAnd.

Lastly, we recently introduced new tuning parameters for weakAnd which lets it adaptively limit the number of matches base on term statistics: https://docs.vespa.ai/en/reference/schema-reference.html#weakand That can be used to further improve efficiency here. (A detailed article on this is coming shortly.)

In summary:

  • Using weakAnd in combination with any recency-biased ranking function will produce all the recent best matches in all cases (i.e when there are also many equally-good older matches).
  • To experiment with further efficiency improvements, experiment with adding match-phase on the timestamp attribute and/or tune weakAnd.

@arasraj
Copy link
Author

arasraj commented Feb 5, 2025

Thanks for the explanations. A few questions below:

Regarding the statement:

Vespa's weakAnd will (in contrast to standard wand) match all documents that scores the same.

If I understand correctly, there could be many documents all with the same match score such that documents falling inside the top-k and outside of it, all have the same score. In this scenario, Vespa's weakAnd will still consider all such docs in first-phase ranking regardless of what k was set to. Is this accurate?

My next question is why would there be so many documents with the exact same score? Likely because Vespa's weakAnd term significance score is not dependent on individual document level statistics like term frequency? Rather, it is based on index wide term statistics (similar to IDF) which don't change depending on the current document being scored. Is my understanding correct here?

The solutions you mention make sense for queries using Vespa's weakAnd. However, what would the solution be for queries using regular wand? For example, if using wand for a SPLADE like query, I think the recency problem still remains? Maybe using the proposed solution of adding a match-phrase on the timestamp attribute is necessary here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants