Replace Marshal by Granular_marshal in ocaml-index #1875

Lucccyo · 2024-12-13T14:38:55Z

The current implementation of ocaml-index uses Marshal to store on the disk the data.
Searching for occurrences on massive projects is time-consuming because the search loads all the data structures from the disk to perform the search.

This Pull Request aims to replace Marshal with a granular version to make the ocaml-index more efficient in reading.
It comes with two granular implementations of the data structures set and map, based on the Stdlib implementation.
During a search operation, the program lazily loads only the required part of the ocaml-index.
It works because the heavy nodes of the granular_map and granular_set have link indirections,
introducing serialization boundaries, which allows Marshal to delay the deserialization of their children.

art-w

This looks very good, thank you! I left some comments for your consideration :)

I'm slightly worried by the overhead of reachable_words in my benchmarks, but a small change should reduce its costs to be negligible.. although it was hard to explain the optim without giving up the solution, so let me know if it's unclear!

src/index-format/granular_map.ml

art-w · 2024-12-17T12:24:35Z

src/index-format/granular_set.ml

+  val elements: t -> elt list
+  val schema: Granular_marshal.iter ->
+    (elt -> unit) -> s Granular_marshal.link -> unit
+  end


Running dune build @fmt should fix the inconsistent formatting :)

art-w · 2024-12-17T12:29:34Z

src/index-format/index_format.ml

+
+let index_schema (iter: Granular_marshal.iter) index =
+  Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.defs;
+  Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated


Suggested change

Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated

Uid_map.schema iter (fun iter _ v -> lidset_schema iter v) index.approximated

While the code is running correctly today because the iter value is constant across recursive calls, it's slightly wrong and could cause problem once we make changes to the granular marshal algorithms (e.g. the optimization suggestion for reachable_words). It would be safer for Uid_map.schema to pass the latest iter value so that lidset_schema can use the right one.

Is the commit 453a936 answers what you mean?

Yes, thank you! I think the same change could be made for sets :)

art-w · 2024-12-17T12:38:54Z

src/utils/file_cache.ml

@@ -85,7 +87,7 @@ struct
        true
      end
      else begin
-        false
+        false (* TODO: should dispose + remove ? *)


TODO reminder ^^ Indeed we can dispose of the expired file and remove it from the Hashtbl?

I agree that we can dispose and remove, but was it a question?
Should we do that here? I think the function check is not supposed to update the Hashtbl, but I might be wrong.

Right, it would also be possible to just remove the TODO :) Otherwise I think it's fine for check to clean up the hashtbl once it discovers expired elements, since they won't ever be useful.

art-w · 2024-12-17T12:46:49Z

src/index-format/granular_marshal.ml

+  and write_child : type a. out_channel -> a link -> a schema -> a -> unit =
+   fun fd lnk schema v ->
+    schema iter v;
+    if Stdlib.Obj.(reachable_words (repr v)) > 500 then (


After running some benchmarks, it looks like reachable_words is both essential to keep files small, but also very expensive. Since we compute the size of every children link before computing the size of their parent, we can speed up the parent size reachable_words computation by avoiding a re-traversal of its children (if we mark them as visited, and keep track of their total size elsewhere).

(I can share my experiments if needed, but it requires fixing the schema definitions to use the correct iter to precisely track which link is a child of a parent node instead of using the same iter everywhere)

I got the point, and I agree. Is it efficient to have a side hashtabl or another data structure to store child weight after serialization (or not), and to retrieve them in a way to do only the addition when the father occurs?

The hashtbl would also enable us to de-duplicate, but I tested this a bit and it's very expensive... I was hoping that we could do something simpler:

Now that you have fixed the map schema, the call to schema iter v will call iter.yield on every child link of v. This will recursively compute the size of every child link (and maybe marshal the big ones). Instead of forgetting the Small children size, we could sum them into a reference such that their parent knows their total children size.

To speed up the parent reachable_words computation, we'll need to avoid it recursing over the children (since we already accumulated their total size): This can be done if we can "hide" the Small children, by temporarily replacing them with a Placeholder value. Once the parent size has been computed, we'll need to restore the Placeholder back to their original Small value.

art-w · 2024-12-20T18:19:50Z

Thanks for the fixes! I rebased your PR + cleaned up formatting at https://github.com/art-w/merlin/tree/lucccyo-marshal with small changes to address my remaining comments :) Happy holidays everyone!

Replace Marshal by Granular_marshal in ocaml-index

666e255

art-w suggested changes Dec 17, 2024

View reviewed changes

Lucccyo added 2 commits December 18, 2024 11:59

add licenses

5f775b5

update schema fonction to use the latest iter in the schema function

453a936

Lucccyo force-pushed the main branch from 528e2b1 to 453a936 Compare December 18, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Marshal by Granular_marshal in ocaml-index #1875

Replace Marshal by Granular_marshal in ocaml-index #1875

Lucccyo commented Dec 13, 2024

art-w left a comment

art-w Dec 17, 2024

art-w Dec 17, 2024

Lucccyo Dec 18, 2024

art-w Dec 19, 2024

art-w Dec 17, 2024

Lucccyo Dec 18, 2024

art-w Dec 19, 2024

art-w Dec 17, 2024

Lucccyo Dec 18, 2024

art-w Dec 19, 2024

art-w commented Dec 20, 2024

	Uid_map.schema iter (fun _ v -> lidset_schema iter v) index.approximated
	Uid_map.schema iter (fun iter _ v -> lidset_schema iter v) index.approximated

Replace Marshal by Granular_marshal in ocaml-index #1875

Are you sure you want to change the base?

Replace Marshal by Granular_marshal in ocaml-index #1875

Conversation

Lucccyo commented Dec 13, 2024

art-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

art-w commented Dec 20, 2024