-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SortedSet DV Multi Range query #13974
base: main
Are you sure you want to change the base?
Conversation
fails 2nd pass on -Dtests.seed=C73CAF65600D946E
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkhludnev ! I haven't fully combed through this change but did an initial pass and wanted to share some thoughts since I meant to look at this a while ago and it slipped through the cracks. I appreciate you taking this on!
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
return empty(); // no bitset yet, give up | ||
} | ||
minOrd = termsEnum.ord(); | ||
if (skipper != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure I understand how the skipper helps us here. Any ordinal you get from the terms enum is going to be present somewhere in the segment, so by definition, I don't see how the global min or max from the skipper could provide tighter bounds. Maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to think through. Honestly, never see DocValuesSkipper
, have only vague idea how to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmiller How to test skipper logic via test like https://github.com/apache/lucene/pull/13974/files#diff-5c4a4738d4643b48f34fb00fa3b67098f9bf4f1744e12955372df0f32812670d ?
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
if (ord >= finalMinOrd | ||
&& ((finalMatchesAbove < values.getValueCount() | ||
&& ord >= finalMatchesAbove) | ||
|| finalMatchingOrdsShifted.get(ord - finalMinOrd))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of densely memorizing all ordinals that fall within the ranges has me a bit worried. This could potentially be a quite large bitset. I wonder if we could instead just keep track of all the ranges actually represented in the query and then just check against them for each doc ord? Like what we do in SortedSetDocValuesRangeQuery
but keeping tracking of multiple min/max tuples to check against instead of only one? Is there a reason you didn't go with that approach and feel the dense bitset is necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my first though too. However, then I've found DocValuesRewriteMethod where there's the set over ordinals. So, if it's feasible for set of point values, why not do the same for set of ranges?
Perhaps, it needs to set too many 1s for ranges which fully cover all ords? I dont know.
If we think about Numeric or Binary DVs, there are no ords, and we have to search in the tree of ranges. Also, for me it's purposed for queries with more that 1000 ranges, because smaller sets can be handled via plain BooleanQuery.SHOULD.
Anyway, I'm really open for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. As you point out, there are definitely other situations where we use this bitset approach. My only thought here is that we should be able to use less memory by keeping track of the range min/max values and not needing the dense representation (this isn't really as feasible with DocValuesRewriteMethod
since it's encoding a set of unique terms that aren't particularly likely to be contiguous ranges). If we track ranges, we can take advantage of them being sorted and the doc-level ordinals also being sorted. A binary search approach might be a good fit, and we can keep pushing up the lower bound of the search based on the last range that matched (since doc-level ordinals are sorted). I suspect this would still be less efficient than the dense bitset approach, but I bet it wouldn't be that big of a difference.
Without a way to benchmark and test though, it's just a bit of a guess. I'm OK with keeping it the way you have it for now if that's your preference. We can always benchmark alternatives in the future and evolve this if we want. Should we do that for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree. Thanks for sharing your considerations. Let me clarify your points:
isn't really as feasible with DocValuesRewriteMethod
Why this "set" query can't use a tree of unique values with log(n) access complexity? These queries might be quite close to each other. In the edge case ranges are quite narrow having 1 or 2 ordinals, thus SsDvMultyRange would be the same as TermsInSet. If ranges are wide, it definitely spends more time on setting 1s in the bitset, but maybe it's not lethal due to SIMD intrinsic (who knows) with a gain on search time per doc check.
I agree it's a field for benchmarking and picking optimizations dynamically (later). Also, something like KD-tree might be applicable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can take advantage of them being sorted and the doc-level ordinals also being sorted.
Frankly speaking, I suppose singular values are more common than long sets of values, so, I don't think it get much gain for many users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less memory by keeping track of the range min/max values and not needing the dense representation
I just realized, if we merge overlapping ranges (which turns not so easy, but feasible), for remaining non-overlapping ranges we always have smaller and greater neighbor, that means we can just search ranges by ordinal in TreeSet
.
Is it worth to explore instead of dense bit set?
@gsmiller thanks for reviewing. Looking into! |
* <li>field values have fixed width | ||
* </ul> | ||
*/ | ||
public static class SordedSetFieldValueFixedBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmiller what do you think about such a verbose way of constructing query? Is it worth to piggyback on java.function
nice interfaces?
Description
Here's a sketch of DV analog of MultiRangeQuery. This draft uses bitset for expanding multiple ranges.
It passes random test, but internals are rough yet.
What do you think about this idea overall?
@atris, may I request your comments as an author of MultiRangeQuery?