Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortedSet DV Multi Range query #13974

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mkhludnev
Copy link
Member

Description

Here's a sketch of DV analog of MultiRangeQuery. This draft uses bitset for expanding multiple ranges.
It passes random test, but internals are rough yet.
What do you think about this idea overall?

@atris, may I request your comments as an author of MultiRangeQuery?

@mkhludnev mkhludnev marked this pull request as ready for review November 22, 2024 06:35
@mkhludnev mkhludnev changed the title DRAFT: SortedSet DV Multi Range query SortedSet DV Multi Range query Nov 22, 2024
Copy link

github-actions bot commented Dec 7, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 7, 2024
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkhludnev ! I haven't fully combed through this change but did an initial pass and wanted to share some thoughts since I meant to look at this a while ago and it slipped through the cracks. I appreciate you taking this on!

return empty(); // no bitset yet, give up
}
minOrd = termsEnum.ord();
if (skipper != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure I understand how the skipper helps us here. Any ordinal you get from the terms enum is going to be present somewhere in the segment, so by definition, I don't see how the global min or max from the skipper could provide tighter bounds. Maybe I'm missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to think through. Honestly, never see DocValuesSkipper, have only vague idea how to test it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (ord >= finalMinOrd
&& ((finalMatchesAbove < values.getValueCount()
&& ord >= finalMatchesAbove)
|| finalMatchingOrdsShifted.get(ord - finalMinOrd))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of densely memorizing all ordinals that fall within the ranges has me a bit worried. This could potentially be a quite large bitset. I wonder if we could instead just keep track of all the ranges actually represented in the query and then just check against them for each doc ord? Like what we do in SortedSetDocValuesRangeQuery but keeping tracking of multiple min/max tuples to check against instead of only one? Is there a reason you didn't go with that approach and feel the dense bitset is necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first though too. However, then I've found DocValuesRewriteMethod where there's the set over ordinals. So, if it's feasible for set of point values, why not do the same for set of ranges?
Perhaps, it needs to set too many 1s for ranges which fully cover all ords? I dont know.
If we think about Numeric or Binary DVs, there are no ords, and we have to search in the tree of ranges. Also, for me it's purposed for queries with more that 1000 ranges, because smaller sets can be handled via plain BooleanQuery.SHOULD.
Anyway, I'm really open for suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. As you point out, there are definitely other situations where we use this bitset approach. My only thought here is that we should be able to use less memory by keeping track of the range min/max values and not needing the dense representation (this isn't really as feasible with DocValuesRewriteMethod since it's encoding a set of unique terms that aren't particularly likely to be contiguous ranges). If we track ranges, we can take advantage of them being sorted and the doc-level ordinals also being sorted. A binary search approach might be a good fit, and we can keep pushing up the lower bound of the search based on the last range that matched (since doc-level ordinals are sorted). I suspect this would still be less efficient than the dense bitset approach, but I bet it wouldn't be that big of a difference.

Without a way to benchmark and test though, it's just a bit of a guess. I'm OK with keeping it the way you have it for now if that's your preference. We can always benchmark alternatives in the future and evolve this if we want. Should we do that for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree. Thanks for sharing your considerations. Let me clarify your points:

isn't really as feasible with DocValuesRewriteMethod

Why this "set" query can't use a tree of unique values with log(n) access complexity? These queries might be quite close to each other. In the edge case ranges are quite narrow having 1 or 2 ordinals, thus SsDvMultyRange would be the same as TermsInSet. If ranges are wide, it definitely spends more time on setting 1s in the bitset, but maybe it's not lethal due to SIMD intrinsic (who knows) with a gain on search time per doc check.
I agree it's a field for benchmarking and picking optimizations dynamically (later). Also, something like KD-tree might be applicable here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can take advantage of them being sorted and the doc-level ordinals also being sorted.

Frankly speaking, I suppose singular values are more common than long sets of values, so, I don't think it get much gain for many users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less memory by keeping track of the range min/max values and not needing the dense representation

I just realized, if we merge overlapping ranges (which turns not so easy, but feasible), for remaining non-overlapping ranges we always have smaller and greater neighbor, that means we can just search ranges by ordinal in TreeSet.
Is it worth to explore instead of dense bit set?

@mkhludnev
Copy link
Member Author

@gsmiller thanks for reviewing. Looking into!

@github-actions github-actions bot removed the Stale label Dec 20, 2024
* <li>field values have fixed width
* </ul>
*/
public static class SordedSetFieldValueFixedBuilder
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsmiller what do you think about such a verbose way of constructing query? Is it worth to piggyback on java.function nice interfaces?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants