-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DenseRetrievalExactSearch
evaluation
#154
base: main
Are you sure you want to change the base?
Fix DenseRetrievalExactSearch
evaluation
#154
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully understanding yet, maybe you can help me out 😅🧐
@@ -45,6 +46,9 @@ def search(self, | |||
logger.info("Sorting Corpus by document length (Longest first)...") | |||
|
|||
corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True) | |||
if ignore_identical_ids: | |||
# We remove the query from results if it exists in corpus | |||
corpus_ids = [cid for cid in corpus_ids if cid not in query_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this make the task "easier" by removing all other queries as options for each query?
I.e. previously, given query1 the model could wrongly retrieve query2 (if it was also in the corpus).
Now the model cannot retrieve any of the other queries which makes it easier assuming the answer is never another query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thus option was for Quora: You want to find paraphrases of queries, but not the original start query. But this original query will always be ranked first at it is also part of the corpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is why we have the ignore_identical_ids
option I think. This PR only tries to fix ignore_identical_ids=True
case
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k+1, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) | ||
cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You write that Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents.
, but didn't this +1
ensure that that does not happen cuz we retrieve top_k+1
but then only allow top_k
lateron?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the problem comes from this line
if len(result_heaps[query_id]) < top_k: |
So we only keep the top_k (which sometimes include the query inside the retrieved docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I thought the if corpus_id != query_id:
would ensure that the query would never be added to result_heaps[query_id]
🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then why do we get different results? 🧐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to check, we just have to assert that number of results of each query is top_k
. Can you check that please @Muennighoff ?
I noticed there was a problem in the way we handled queries that exist in the retrieval corpus. By default we have
ignore_identical_ids=True
which pops these duplicated queries from theresults
. Which means some queries would havetop_k
retrieved documents, while others havetop_k-1
retrieved documents.Fixing this behaviour gives a noticeable change in scores. Here's the difference in scores noticed for
"intfloat/e5-large"
on ArguAna evaluated using MTEB:Scores before fix:
Scores after fix:
cc @thakur-nandan