SortedSet DV Multi Range query #13974

mkhludnev · 2024-11-03T19:54:54Z

Description

Here's a sketch of DV analog of MultiRangeQuery. This draft uses bitset for expanding multiple ranges.
It passes random test, but internals are rough yet.
What do you think about this idea overall?

@atris, may I request your comments as an author of MultiRangeQuery?

fails 2nd pass on -Dtests.seed=C73CAF65600D946E

github-actions · 2024-12-07T00:24:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

gsmiller

Thanks @mkhludnev ! I haven't fully combed through this change but did an initial pass and wanted to share some thoughts since I meant to look at this a while ago and it slipped through the cracks. I appreciate you taking this on!

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

gsmiller · 2024-12-19T00:57:37Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

+                  return empty(); // no bitset yet, give up
+                }
+                minOrd = termsEnum.ord();
+                if (skipper != null) {


I'm not really sure I understand how the skipper helps us here. Any ordinal you get from the terms enum is going to be present somewhere in the segment, so by definition, I don't see how the global min or max from the skipper could provide tighter bounds. Maybe I'm missing something?

Need to think through. Honestly, never see DocValuesSkipper, have only vague idea how to test it.

@gsmiller How to test skipper logic via test like https://github.com/apache/lucene/pull/13974/files#diff-5c4a4738d4643b48f34fb00fa3b67098f9bf4f1744e12955372df0f32812670d ?

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

gsmiller · 2024-12-19T01:26:09Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

+                      if (ord >= finalMinOrd
+                          && ((finalMatchesAbove < values.getValueCount()
+                                  && ord >= finalMatchesAbove)
+                              || finalMatchingOrdsShifted.get(ord - finalMinOrd))) {


This approach of densely memorizing all ordinals that fall within the ranges has me a bit worried. This could potentially be a quite large bitset. I wonder if we could instead just keep track of all the ranges actually represented in the query and then just check against them for each doc ord? Like what we do in SortedSetDocValuesRangeQuery but keeping tracking of multiple min/max tuples to check against instead of only one? Is there a reason you didn't go with that approach and feel the dense bitset is necessary here?

This was my first though too. However, then I've found DocValuesRewriteMethod where there's the set over ordinals. So, if it's feasible for set of point values, why not do the same for set of ranges?
Perhaps, it needs to set too many 1s for ranges which fully cover all ords? I dont know.
If we think about Numeric or Binary DVs, there are no ords, and we have to search in the tree of ranges. Also, for me it's purposed for queries with more that 1000 ranges, because smaller sets can be handled via plain BooleanQuery.SHOULD.
Anyway, I'm really open for suggestions.

Right. As you point out, there are definitely other situations where we use this bitset approach. My only thought here is that we should be able to use less memory by keeping track of the range min/max values and not needing the dense representation (this isn't really as feasible with DocValuesRewriteMethod since it's encoding a set of unique terms that aren't particularly likely to be contiguous ranges). If we track ranges, we can take advantage of them being sorted and the doc-level ordinals also being sorted. A binary search approach might be a good fit, and we can keep pushing up the lower bound of the search based on the last range that matched (since doc-level ordinals are sorted). I suspect this would still be less efficient than the dense bitset approach, but I bet it wouldn't be that big of a difference.

Without a way to benchmark and test though, it's just a bit of a guess. I'm OK with keeping it the way you have it for now if that's your preference. We can always benchmark alternatives in the future and evolve this if we want. Should we do that for now?

I mostly agree. Thanks for sharing your considerations. Let me clarify your points:

isn't really as feasible with DocValuesRewriteMethod

Why this "set" query can't use a tree of unique values with log(n) access complexity? These queries might be quite close to each other. In the edge case ranges are quite narrow having 1 or 2 ordinals, thus SsDvMultyRange would be the same as TermsInSet. If ranges are wide, it definitely spends more time on setting 1s in the bitset, but maybe it's not lethal due to SIMD intrinsic (who knows) with a gain on search time per doc check.
I agree it's a field for benchmarking and picking optimizations dynamically (later). Also, something like KD-tree might be applicable here.

we can take advantage of them being sorted and the doc-level ordinals also being sorted.

Frankly speaking, I suppose singular values are more common than long sets of values, so, I don't think it get much gain for many users.

less memory by keeping track of the range min/max values and not needing the dense representation

I just realized, if we merge overlapping ranges (which turns not so easy, but feasible), for remaining non-overlapping ranges we always have smaller and greater neighbor, that means we can just search ranges by ordinal in TreeSet.
Is it worth to explore instead of dense bit set?

mkhludnev · 2024-12-19T13:21:48Z

@gsmiller thanks for reviewing. Looking into!

- copied range into

mkhludnev · 2024-12-27T10:19:52Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/DocValuesMultiRangeQuery.java

+   *   <li>field values have fixed width
+   * </ul>
+   */
+  public static class SordedSetFieldValueFixedBuilder


@gsmiller what do you think about such a verbose way of constructing query? Is it worth to piggyback on java.function nice interfaces?

mikhail-khludnev and others added 7 commits October 30, 2024 00:36

first failed test. just an api

7fb1b0f

silly bug fxd

0957905

add some pivot randomnes

af41025

horrible impl.

b74dd36

fails 2nd pass on -Dtests.seed=C73CAF65600D946E

it works

e310500

some cleanup

79971e5

some cleanup

cbc588a

mkhludnev mentioned this pull request Nov 6, 2024

[Feature Request] relax max Clauses Count limitation of termS query over IP field opensearch-project/OpenSearch#16200

Closed

mkhludnev mentioned this pull request Nov 16, 2024

Support more than 1024 IP/masks with indexed field opensearch-project/OpenSearch#16391

Merged

3 tasks

mkhludnev added 2 commits November 21, 2024 23:28

sweep

f0b2700

javadoc

3ad7140

mkhludnev marked this pull request as ready for review November 22, 2024 06:35

mkhludnev changed the title ~~DRAFT: SortedSet DV Multi Range query~~ SortedSet DV Multi Range query Nov 22, 2024

github-actions bot added the Stale label Dec 7, 2024

gsmiller reviewed Dec 19, 2024

View reviewed changes

github-actions bot removed the Stale label Dec 20, 2024

mkhludnev added 6 commits December 23, 2024 10:11

in the middle of PR feedback

eb2990a

review in progress

8b2655d

- renamed Builder

4fe35c0

- copied range into

added duel with PointsMultiRange

416f085

tidy

9fcd549

expose only builder with java function interface.

db161d4

mkhludnev commented Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SortedSet DV Multi Range query #13974

SortedSet DV Multi Range query #13974

mkhludnev commented Nov 3, 2024

github-actions bot commented Dec 7, 2024

gsmiller left a comment

gsmiller Dec 19, 2024

mkhludnev Dec 19, 2024

mkhludnev Dec 19, 2024

gsmiller Dec 19, 2024

mkhludnev Dec 19, 2024

gsmiller Dec 19, 2024

mkhludnev Dec 24, 2024

mkhludnev Dec 24, 2024

mkhludnev Dec 27, 2024

mkhludnev commented Dec 19, 2024

mkhludnev Dec 27, 2024

SortedSet DV Multi Range query #13974

Are you sure you want to change the base?

SortedSet DV Multi Range query #13974

Conversation

mkhludnev commented Nov 3, 2024

Description

github-actions bot commented Dec 7, 2024

gsmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkhludnev commented Dec 19, 2024

Choose a reason for hiding this comment