Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Marshal by Granular_marshal in ocaml-index #1875

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Lucccyo
Copy link

@Lucccyo Lucccyo commented Dec 13, 2024

The current implementation of ocaml-index uses Marshal to store on the disk the data.
Searching for occurrences on massive projects is time-consuming because the search loads all the data structures from the disk to perform the search.

This Pull Request aims to replace Marshal with a granular version to make the ocaml-index more efficient in reading.
It comes with two granular implementations of the data structures set and map, based on the Stdlib implementation.
During a search operation, the program lazily loads only the required part of the ocaml-index.
It works because the heavy nodes of the granular_map and granular_set have link indirections,
introducing serialization boundaries, which allows Marshal to delay the deserialization of their children.

Copy link
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good, thank you! I left some comments for your consideration :)

I'm slightly worried by the overhead of reachable_words in my benchmarks, but a small change should reduce its costs to be negligible.. although it was hard to explain the optim without giving up the solution, so let me know if it's unclear!

src/index-format/granular_map.ml Show resolved Hide resolved
val elements: t -> elt list
val schema: Granular_marshal.iter ->
(elt -> unit) -> s Granular_marshal.link -> unit
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running dune build @fmt should fix the inconsistent formatting :)


let index_schema (iter: Granular_marshal.iter) index =
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.defs;
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated
Uid_map.schema iter (fun iter _ v -> lidset_schema iter v) index.approximated

While the code is running correctly today because the iter value is constant across recursive calls, it's slightly wrong and could cause problem once we make changes to the granular marshal algorithms (e.g. the optimization suggestion for reachable_words). It would be safer for Uid_map.schema to pass the latest iter value so that lidset_schema can use the right one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the commit 453a936 answers what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you! I think the same change could be made for sets :)

@@ -85,7 +87,7 @@ struct
true
end
else begin
false
false (* TODO: should dispose + remove ? *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO reminder ^^ Indeed we can dispose of the expired file and remove it from the Hashtbl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can dispose and remove, but was it a question?
Should we do that here? I think the function check is not supposed to update the Hashtbl, but I might be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it would also be possible to just remove the TODO :) Otherwise I think it's fine for check to clean up the hashtbl once it discovers expired elements, since they won't ever be useful.

and write_child : type a. out_channel -> a link -> a schema -> a -> unit =
fun fd lnk schema v ->
schema iter v;
if Stdlib.Obj.(reachable_words (repr v)) > 500 then (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running some benchmarks, it looks like reachable_words is both essential to keep files small, but also very expensive. Since we compute the size of every children link before computing the size of their parent, we can speed up the parent size reachable_words computation by avoiding a re-traversal of its children (if we mark them as visited, and keep track of their total size elsewhere).

(I can share my experiments if needed, but it requires fixing the schema definitions to use the correct iter to precisely track which link is a child of a parent node instead of using the same iter everywhere)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point, and I agree. Is it efficient to have a side hashtabl or another data structure to store child weight after serialization (or not), and to retrieve them in a way to do only the addition when the father occurs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashtbl would also enable us to de-duplicate, but I tested this a bit and it's very expensive... I was hoping that we could do something simpler:

  • Now that you have fixed the map schema, the call to schema iter v will call iter.yield on every child link of v. This will recursively compute the size of every child link (and maybe marshal the big ones). Instead of forgetting the Small children size, we could sum them into a reference such that their parent knows their total children size.
  • To speed up the parent reachable_words computation, we'll need to avoid it recursing over the children (since we already accumulated their total size): This can be done if we can "hide" the Small children, by temporarily replacing them with a Placeholder value. Once the parent size has been computed, we'll need to restore the Placeholder back to their original Small value.

@art-w
Copy link
Contributor

art-w commented Dec 20, 2024

Thanks for the fixes! I rebased your PR + cleaned up formatting at https://github.com/art-w/merlin/tree/lucccyo-marshal with small changes to address my remaining comments :) Happy holidays everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants