-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Marshal by Granular_marshal in ocaml-index #1875
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good, thank you! I left some comments for your consideration :)
I'm slightly worried by the overhead of reachable_words
in my benchmarks, but a small change should reduce its costs to be negligible.. although it was hard to explain the optim without giving up the solution, so let me know if it's unclear!
val elements: t -> elt list | ||
val schema: Granular_marshal.iter -> | ||
(elt -> unit) -> s Granular_marshal.link -> unit | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running dune build @fmt
should fix the inconsistent formatting :)
src/index-format/index_format.ml
Outdated
|
||
let index_schema (iter: Granular_marshal.iter) index = | ||
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.defs; | ||
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated | |
Uid_map.schema iter (fun iter _ v -> lidset_schema iter v) index.approximated |
While the code is running correctly today because the iter
value is constant across recursive calls, it's slightly wrong and could cause problem once we make changes to the granular marshal algorithms (e.g. the optimization suggestion for reachable_words
). It would be safer for Uid_map.schema
to pass the latest iter
value so that lidset_schema
can use the right one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the commit 453a936 answers what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you! I think the same change could be made for sets :)
@@ -85,7 +87,7 @@ struct | |||
true | |||
end | |||
else begin | |||
false | |||
false (* TODO: should dispose + remove ? *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO reminder ^^ Indeed we can dispose
of the expired file and remove it from the Hashtbl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can dispose
and remove
, but was it a question?
Should we do that here? I think the function check
is not supposed to update the Hashtbl, but I might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it would also be possible to just remove the TODO :) Otherwise I think it's fine for check
to clean up the hashtbl once it discovers expired elements, since they won't ever be useful.
and write_child : type a. out_channel -> a link -> a schema -> a -> unit = | ||
fun fd lnk schema v -> | ||
schema iter v; | ||
if Stdlib.Obj.(reachable_words (repr v)) > 500 then ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running some benchmarks, it looks like reachable_words
is both essential to keep files small, but also very expensive. Since we compute the size of every children link before computing the size of their parent, we can speed up the parent size reachable_words
computation by avoiding a re-traversal of its children (if we mark them as visited, and keep track of their total size elsewhere).
(I can share my experiments if needed, but it requires fixing the schema
definitions to use the correct iter
to precisely track which link is a child of a parent node instead of using the same iter
everywhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the point, and I agree. Is it efficient to have a side hashtabl
or another data structure to store child weight after serialization (or not), and to retrieve them in a way to do only the addition when the father occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashtbl would also enable us to de-duplicate, but I tested this a bit and it's very expensive... I was hoping that we could do something simpler:
- Now that you have fixed the map schema, the call to
schema iter v
will calliter.yield
on every child link ofv
. This will recursively compute the size of every child link (and maybe marshal the big ones). Instead of forgetting the Small children size, we could sum them into a reference such that their parent knows their total children size. - To speed up the parent
reachable_words
computation, we'll need to avoid it recursing over the children (since we already accumulated their total size): This can be done if we can "hide" theSmall
children, by temporarily replacing them with aPlaceholder
value. Once the parent size has been computed, we'll need to restore thePlaceholder
back to their originalSmall
value.
Thanks for the fixes! I rebased your PR + cleaned up formatting at https://github.com/art-w/merlin/tree/lucccyo-marshal with small changes to address my remaining comments :) Happy holidays everyone! |
The current implementation of
ocaml-index
uses Marshal to store on the disk the data.Searching for occurrences on massive projects is time-consuming because the search loads all the data structures from the disk to perform the search.
This Pull Request aims to replace Marshal with a granular version to make the
ocaml-index
more efficient in reading.It comes with two granular implementations of the data structures
set
andmap
, based on the Stdlib implementation.During a search operation, the program lazily loads only the required part of the
ocaml-index
.It works because the heavy nodes of the
granular_map
andgranular_set
havelink
indirections,introducing serialization boundaries, which allows Marshal to delay the deserialization of their children.