Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

dungba88 · 2024-11-22T00:49:52Z

Description

Added a new Query which wraps around KnnFloatVectorQuery and does re-ranking for quantized index using full precision vectors. The idea is to first run KnnFloatVectorQuery with over-sampled k (x1.5, x2, x5, etc) and then re-rank the docs using full-precision (original, non-quantized) vector, and finally take top-k.

Questions:

Should we expose the target inside KnnFloatVectorQuery so that users don't need to pass the target twice? Currently it only exposes the getTargetCopy() which requires array copy so it's inefficient, but I assume the intention is to encapsulate the array so that it won't be modified from outside?
Maybe out of scope for this PR, but I'm curious how people think about using mlock for preventing the quantized vectors from being swapped out, as loading fp vectors (although only a small set per query) means there will more pressure on RAM.

Usage:

KnnFloatVectorQuery knnQuery = ...; // create the KnnFloatVectorQuery with some over-sampled k
RerankKnnFloatVectorQuery query = new RerankKnnFloatVectorQuery(knnQuery, targetVector, k);
TopDocs topDocs = searcher.search(query, k);

dungba88 · 2024-11-22T00:54:57Z

The build fails with The import org.apache.lucene.codecs.lucene100 cannot be resolved, I thought this is already in mainline. Will check.

Edit: It has been moved to backward codecs. Will use something more stable.

dungba88 · 2024-11-26T06:41:21Z

I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 dataset.

Anyhow I can see these 2 things that should be addressed:

If we access the full-sized vectors, it will swap the memory that is allocated (either through preloading, or through mmap) for quantized vectors (main search phase) if there's not enough memory. Eventually, some % part of the quantized index will be swapped out which will slower the search. If we have to load all full-precision vectors to memory, then that kinda defeats the purpose of quantization. I'm wondering if there could be a way we can access full-precision vectors without interfering with the space of quantized vectors.
The latency could be better. With oversample=1.5 (second dot) for 4_bit, we have around the same latency and recall as baseline. Although one can argue that we can save memory compared to baseline, with new access pattern of two-phase search that saving might be diminished. Otherwise it seems to have little benefit over just using plain HNSW.

shatejas · 2024-11-26T07:49:30Z

lucene/core/src/java/org/apache/lucene/search/RerankKnnFloatVectorQuery.java

+    }
+    Weight weight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1.0f);
+    HitQueue queue = new HitQueue(k, false);
+    for (var leaf : reader.leaves()) {


Should this be switched to parallel execution similar to AbstractKnnVectorQuery?

Good question, I was using single-thread as a simple version and try to benchmark the latency first, since multi-thread could add some overhead as well. This class only does vector loading and similarity computation for a small set of vectors (k * oversample) so it's not as CPU-intensive as the AbstractKnnVectorQuery

I'll also try multi-thread and run the benchmark again. From the below benchmark, the re-ranking phase only adds a trivial amount of latency it might not help much. Also the benchmark code seems to force merge so there's only a single partition, we need to change so that there are multiple partitions.

dungba88 · 2024-11-27T03:57:18Z

Edit: My previous benchmark was wrong because the vectors are corrupted

First benchmark show the recall improvement for each oversample with reranking. It now aligns with what was produced in #13651.

Second benchmark compare the latency across all algorithms. We are still adding only a small latency for the reranking phase.

Last benchmark, I just ran oversample without reranking, but still cutoff at original K (so they act similar to fanout). This is just to make sure that the reranking phase actually adds value. Expectedly, the recall does not improve much compared to the reranking.

dungba88 · 2024-11-27T23:35:04Z

Also this is the luceneutil branch I used for benchmarking: https://github.com/dungba88/luceneutil/tree/dungba88/two-phase-search, which incorporates the test for BQ implementation by @benwtrent and the two-phase search.

huynmg · 2024-11-28T07:12:07Z

lucene/core/src/test/org/apache/lucene/search/TestRerankKnnFloatVectorQuery.java

+        float expectedScore = VECTOR_SIMILARITY_FUNCTION.compare(targetVector, docVector);
+        Assert.assertEquals(
+            "Score does not match expected similarity for doc ord: " + scoreDoc.doc + ", id: " + id,
+            expectedScore,
+            scoreDoc.score,
+            1e-5);
+      }


We can test that the results are sorted by exact distance.

Maybe we can also test that the result of the same query with oversample will be "at lease the same or better" than without oversample ? By "better" I mean we should have higher recall. But I'm not sure if it's deterministic

Thinking again, the docs should be sorted by ord, so my first point should be irrelevant.

shubhamvishu · 2024-11-28T19:15:45Z

lucene/core/src/java/org/apache/lucene/search/RerankKnnFloatVectorQuery.java

+    HitQueue queue = new HitQueue(k, false);
+    for (var leaf : reader.leaves()) {


Here we have the access to IndexSearcher#getTaskExecutor and could use it to parallelize the work across segments(like we did earlier with some other query rewrites). But the HitQueue here isn't thread-safe. I don't know if using concurrency after making insertWithOverflow thread-safe would be really helpful since it looks like the added cost is cheap? or Maybe it will be?

That's right. In order to apply parallelism we need to use a per-segment queue, then merge it like in AbstractKnnVectorQuery.mergeLeafResults. I think the added latency is already low, but still want to try if it helps.

github-actions · 2024-12-14T00:23:51Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

dungba88 added 12 commits November 15, 2024 13:16

Initial commit: Add TwoPhaseKnnVectorQuery

edef1b2

Add tests

bbc7081

Remove forbidden API

96d2987

Remove forbidden API

e2ab4bc

Add javadoc

4e32971

Make the Query experimental

ccd3e25

Use Math.ceil instead of rounding

f9da336

Store target separately in child class

8d88cab

Change abstraction to wrap around KNN query

b67637a

Fix doc ord bug & flush writer multiple times

8cd3ccf

Add null check

30e377a

Refactor test case

5d1910c

dungba88 added 2 commits November 22, 2024 10:03

Merge branch 'main' into two-phase-vector

22288e5

Simplify Codec

feda6af

dungba88 mentioned this pull request Nov 22, 2024

Add refinement of quantized vector scores with fp distance calculations #13564

Open

benwtrent self-requested a review November 22, 2024 12:42

short-circuit for case there is no oversample

3178bbc

shatejas reviewed Nov 26, 2024

View reviewed changes

dungba88 changed the title ~~Add Query for reranking KnnFloatVectorQuery~~ Add Query for reranking KnnFloatVectorQuery with full-precision vectors Nov 27, 2024

huynmg reviewed Nov 28, 2024

View reviewed changes

shubhamvishu reviewed Nov 28, 2024

View reviewed changes

github-actions bot added the Stale label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

dungba88 commented Nov 22, 2024 •

edited

Loading

dungba88 commented Nov 22, 2024 •

edited

Loading

dungba88 commented Nov 26, 2024

shatejas Nov 26, 2024

dungba88 Nov 27, 2024 •

edited

Loading

dungba88 commented Nov 27, 2024 •

edited

Loading

dungba88 commented Nov 27, 2024

huynmg Nov 28, 2024

huynmg Nov 28, 2024

shubhamvishu Nov 28, 2024 •

edited

Loading

dungba88 Nov 29, 2024

github-actions bot commented Dec 14, 2024

		HitQueue queue = new HitQueue(k, false);
		for (var leaf : reader.leaves()) {

Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

Are you sure you want to change the base?

Add Query for reranking KnnFloatVectorQuery with full-precision vectors #14009

Conversation

dungba88 commented Nov 22, 2024 • edited Loading

Description

dungba88 commented Nov 22, 2024 • edited Loading

dungba88 commented Nov 26, 2024

shatejas Nov 26, 2024

Choose a reason for hiding this comment

dungba88 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

dungba88 commented Nov 27, 2024 • edited Loading

dungba88 commented Nov 27, 2024

huynmg Nov 28, 2024

Choose a reason for hiding this comment

huynmg Nov 28, 2024

Choose a reason for hiding this comment

shubhamvishu Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

dungba88 Nov 29, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 14, 2024

dungba88 commented Nov 22, 2024 •

edited

Loading

dungba88 commented Nov 22, 2024 •

edited

Loading

dungba88 Nov 27, 2024 •

edited

Loading

dungba88 commented Nov 27, 2024 •

edited

Loading

shubhamvishu Nov 28, 2024 •

edited

Loading