Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

dungba88
Copy link
Contributor

@dungba88 dungba88 commented Nov 22, 2024

Description

fixes #13564

Added a new Query which wraps around KnnFloatVectorQuery and does re-ranking for quantized index using full precision vectors. The idea is to first run KnnFloatVectorQuery with over-sampled k (x1.5, x2, x5, etc) and then re-rank the docs using full-precision (original, non-quantized) vector, and finally take top-k.

Questions:

  • Should we expose the target inside KnnFloatVectorQuery so that users don't need to pass the target twice? Currently it only exposes the getTargetCopy() which requires array copy so it's inefficient, but I assume the intention is to encapsulate the array so that it won't be modified from outside?
  • Maybe out of scope for this PR, but I'm curious how people think about using mlock for preventing the quantized vectors from being swapped out, as loading fp vectors (although only a small set per query) means there will more pressure on RAM.

Usage:

KnnFloatVectorQuery knnQuery = ...; // create the KnnFloatVectorQuery with some over-sampled k
RerankKnnFloatVectorQuery query = new RerankKnnFloatVectorQuery(knnQuery, targetVector, k);
TopDocs topDocs = searcher.search(query, k);

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 22, 2024

The build fails with The import org.apache.lucene.codecs.lucene100 cannot be resolved, I thought this is already in mainline. Will check.

Edit: It has been moved to backward codecs. Will use something more stable.

@dungba88
Copy link
Contributor Author

I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 dataset.

image

Anyhow I can see these 2 things that should be addressed:

  • If we access the full-sized vectors, it will swap the memory that is allocated (either through preloading, or through mmap) for quantized vectors (main search phase) if there's not enough memory. Eventually, some % part of the quantized index will be swapped out which will slower the search. If we have to load all full-precision vectors to memory, then that kinda defeats the purpose of quantization. I'm wondering if there could be a way we can access full-precision vectors without interfering with the space of quantized vectors.
  • The latency could be better. With oversample=1.5 (second dot) for 4_bit, we have around the same latency and recall as baseline. Although one can argue that we can save memory compared to baseline, with new access pattern of two-phase search that saving might be diminished. Otherwise it seems to have little benefit over just using plain HNSW.

}
Weight weight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1.0f);
HitQueue queue = new HitQueue(k, false);
for (var leaf : reader.leaves()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be switched to parallel execution similar to AbstractKnnVectorQuery?

Copy link
Contributor Author

@dungba88 dungba88 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I was using single-thread as a simple version and try to benchmark the latency first, since multi-thread could add some overhead as well. This class only does vector loading and similarity computation for a small set of vectors (k * oversample) so it's not as CPU-intensive as the AbstractKnnVectorQuery

I'll also try multi-thread and run the benchmark again. From the below benchmark, the re-ranking phase only adds a trivial amount of latency it might not help much. Also the benchmark code seems to force merge so there's only a single partition, we need to change so that there are multiple partitions.

@dungba88
Copy link
Contributor Author

dungba88 commented Nov 27, 2024

Edit: My previous benchmark was wrong because the vectors are corrupted

First benchmark show the recall improvement for each oversample with reranking. It now aligns with what was produced in #13651.

Screenshot 2024-11-30 at 7 50 50

Second benchmark compare the latency across all algorithms. We are still adding only a small latency for the reranking phase.

Screenshot 2024-11-30 at 7 57 25

Last benchmark, I just ran oversample without reranking, but still cutoff at original K (so they act similar to fanout). This is just to make sure that the reranking phase actually adds value. Expectedly, the recall does not improve much compared to the reranking.

Screenshot 2024-11-30 at 7 52 13

@dungba88 dungba88 changed the title Add Query for reranking KnnFloatVectorQuery Add Query for reranking KnnFloatVectorQuery with full-precision vectors Nov 27, 2024
@dungba88
Copy link
Contributor Author

Also this is the luceneutil branch I used for benchmarking: https://github.com/dungba88/luceneutil/tree/dungba88/two-phase-search, which incorporates the test for BQ implementation by @benwtrent and the two-phase search.

Comment on lines +107 to +113
float expectedScore = VECTOR_SIMILARITY_FUNCTION.compare(targetVector, docVector);
Assert.assertEquals(
"Score does not match expected similarity for doc ord: " + scoreDoc.doc + ", id: " + id,
expectedScore,
scoreDoc.score,
1e-5);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test that the results are sorted by exact distance.

Maybe we can also test that the result of the same query with oversample will be "at lease the same or better" than without oversample ? By "better" I mean we should have higher recall. But I'm not sure if it's deterministic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking again, the docs should be sorted by ord, so my first point should be irrelevant.

Comment on lines +63 to +64
HitQueue queue = new HitQueue(k, false);
for (var leaf : reader.leaves()) {
Copy link
Contributor

@shubhamvishu shubhamvishu Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have the access to IndexSearcher#getTaskExecutor and could use it to parallelize the work across segments(like we did earlier with some other query rewrites). But the HitQueue here isn't thread-safe. I don't know if using concurrency after making insertWithOverflow thread-safe would be really helpful since it looks like the added cost is cheap? or Maybe it will be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. In order to apply parallelism we need to use a per-segment queue, then merge it like in AbstractKnnVectorQuery.mergeLeafResults. I think the added latency is already low, but still want to try if it helps.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add refinement of quantized vector scores with fp distance calculations
4 participants