-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boxed roots #22
base: master
Are you sure you want to change the base?
Boxed roots #22
Conversation
I believe this RFC can interest you: @stedolan, @kayceesrk, @sadiqj, @Sudha247, @damiendoligez (otherwise, please apologize me for the spam). |
(also @let-def obviously, as the author of the related work on CAMLroots) |
I find this proposal very nice, thanks @gadmm (and @gasche). First: performance aside, could the boxroot API be implemented using just a doubly linked list for bookkeeping? The description makes it sound as if it’s more complex than the current implementation of roots. Second: do you think a weak root could be added in the future? Managing complex data structures with global roots in C can cause subtle memory leaks. Using weak roots and ensuring that all C objects are dominated by OCaml ones solves this problem (the GC can now see the precise dependencies between objects, without going through roots). In other words, the root API fulfils two purposes: referencing an OCaml object from C and keeping alive an OCaml object from C. Weak roots would allow doing only the former. I’m not asking for any change, but it would be nice to keep this use case in mind during future work on the proposal, to ease a possible integration in the future. Thanks. |
Yes -- we would use a generational design two linked lists, one that may contain young objects and one that may only contain old objects. I could try to implement this to get performance numbers. If you don't care about performance, another basic implementation, that is easy to have as a user-side library, is to simply malloc each root, and register the malloced address as a global generational root. In our benchmarks this performs exactly as well (or badly) as generational global roots, when both can be used, the cost due to the extra malloc is minimal.
One naive way to implement that is to allocate a boxroot that contains a weak pointer. The question is whether a more merged solution can avoid doubling the indirections / book-keeping costs. I think yes: in addition to our young/old distinction, we could do a strong/weak distinction, where strong roots are scanned by the GC, but weak roots are only used to update moved values, just like the existing weak-pointer implementation. I don't know if there are enough hooks to let us implement this as a user library (as our current prototype), this sounds like something that would be much easier to do inside the runtime, but I think it would not be too hard to add to the implementation, and it would not add a noticeable overhead. (If I understand correctly, currently weak pointers are only allocated in the major heap, and we would follow that strategy as well.) Note: another question I had in the background was whether some of the quality-of-life improvements proposed by CAMLroots could be integrated into boxroots as well. Currently our main use-case is Rust FFI, where we can already enforce safety by judiciously using the lifetime discipline, we have not explored how much of the CAMLroots-style bug-avoidance we get with the boxroot API, and whether we can do better with dynamic checking. |
To add to Gabriel's answer (one can already root a malloced value globally), this is not correct since boxroot will also lift some expressiveness constraints regarding cross-thread deallocation, which might be harder to do efficiently with current global roots or doubly-linked lists. (As explained in the RFC, cross-thread deallocation covers two use-cases: 1) being able to drop a root from a thread that does not hold the lock with systhreads, which would be hard to ensure with a Rust-style type system and might be impractical from an ownership perspective, and 2) being able to pass ownership of roots between domains in multicore.) To play along with your train of thoughts, if efficiency was completely out of the picture, this could be done simply by protecting global roots (or a doubly linked list) with a mutex. So efficiency is indeed a crucial aspect to the proposal, but for an apples-to-apples comparison in terms of complexity one must ask how cross-thread deallocation would be implemented with a would-be simpler proposal. (The prototype is not currently thread-safe, but it is unlikely to be much more complicated once it is.)
Is there something bad with rooting a weak pointer? |
The proposed API looks good to me! It's quite similar to the caml_root API in multicore, which had similar motivations. I have two minor questions / suggestions:
Regarding the implementation, have you considered reusing the remembered set? ( |
If I understand correctly, the proposal is to use a non-generational allocator (create new boxroots, either old or new, in a single container (pool, doubly-linked list, etc.), and in addition insert young boxroots to the remembered set? I remember we considered it when a For the latter usage, one downside I see is that roots in the remembered set / reftable are promoted unconditionally, which would be bad for short-lived boxroots (typically if used instead of local roots). And I don't see how to track the position of a young-boxroot in the reftable to "remove" it on boxroot deletion. So I have the impression that our generational approach should be faster for short-lived young boxroots, which are important. Did you have something else in mind? |
Performance and convenience, which are nice to have and in line with this PR. I did what you suggest and would still prefer some assistance from the API. (Actually with a dynamically resized weak array to pool the allocations, quite like this PR). Let me try to emphasise why weak roots are worth supporting. When used in lieu of local roots, the current proposal offers a sensible alternative that enables a safe and efficient Rust API. However the story is not the same when working with data structures. As soon as a finalizer (either from a custom object or from a call to Gc.finalize) is used to release a root, collection behaves differently. The indirection of going through a root hides the dependencies from the GC. Adding a scanning function to custom objects would be even better. Stable pointers can address the issue too: as long as the caller ensures that the stable pointer outlive the external data structure, both can interoperate without affecting the performance profile. Any of these solutions would simplify binding of external libraries that manipulate structured data, and weak roots seems to be the simplest way to achieve that. (Personally, I encountered these issues while working with C libraries that construct graph structures: chipmunk for 2d physic simulation and a few UI toolkits). I hope I am not going too off-topic, I am open to more discussions on other channels. |
I did have something else in mind! Here's a slightly more fleshed-out proposal for the sort of implementation I was imagining: Boxroots are word-sized, containing either a user value (if allocated) or a pointer to the next free boxroot (if deallocated). Free boxroots then form a singly-linked list, used as a stack. The operations on major values are:
As well as the freelist described above, use a second freelist for boxroots that are registered in the remembered set:
This way, short-lived boxroots are just pushed and popped onto either freelist, which is fast. Minor GC will traverse each boxroot that held a minor pointer since the last minor GC. (This is a no-op if the boxroot no longer contains a minor pointer). |
@stedolan this sounds similar to our implementation, but a bit different. (I think our "pools" are your "chunks", we have "minor" pools that contain a mix of young and old values, and "major" pools that contain only old values. When all minor pools are full, we "demote" a major pool with low occupancy. Instead you propose to allocate the minor root in a major pool, using the remembered set. My intuition is that it would be slower, opposite to yours it seems.) Could we discuss these implementation details on the implementation repository directly? Feel free to open an issue at https://gitlab.com/ocaml-rust/ocaml-boxroot/-/issues , or if you give me your Gitlab handle I can open an issue and cc: you.
This is related to the design issues underlying the RFC, so I'm happy to discuss it here. Interesting discussions are a better situation than standard RFCs with no discussion.
If I understand correctly, your idea would be that, instead of using global roots, custom values capturing OCaml values could provide a callback to scan their inner references to OCaml values on the heap. Sounds interesting, and possibly simpler than a workaround using weak roots. Would you like to submit an RFC? :p (I would be happy to propose to experiment with weak roots, but I don't see how to do it outside the runtime, and also the runtime ephemeron runtime support is noticeably more complex than the old weak-pointers support, so it is a bit less tempting.) |
A read barrier as in the concurrent multicore GC would not change the signature of
While it is functionally equivalent to deleting and creating, it does offer the guarantee that it always modifies the boxroot in place except at most once between two minor collections, which is why it is close to modifying the boxroot in place (also you do not need to test whether it is NULL after modification unlike a creation function). I assume in your suggestion, |
I will try to rephrase the issue you are describing. Roots naturally tend to have a reference-counting flavour to them in the sense that they lead to leaks if you let them create cycles of dependencies, which can happen naturally by exchanging values back and forth across the FFI. From this point of view, it makes sense to offer "weak" variants to break the cycles by hand, that would have similar performance characteristics.
Again to be sure I understand:
I think this is on-topic, especially as it gives an example of something that would be easier to do if |
Thanks, this is a very interesting idea. A couple points:
For the performance in light workloads (only a couple of roots allocated between two minor collections), we were already thinking about using the remembered set for the first few young roots after a minor collection before switching to young pool allocation, but not in a way that had a chance to scale to heavier workloads. Due to the simpler implementation it is definitely something I would like to try, and only the performance figures can decide. |
Yes, that's precisely that.
You are right that the three techniques achieve different things, sorry for conflating them in my previous message and thanks for the clarification. Of all three, i think weak roots & custom scanning would help the most. I indeed assume "solved" the problem of tracking values contained in a foreign data structure (there is no generic answer to that question, it is very library specific; and, worse, non-rust libraries rarely specify which value is retained in which scope). A large part of my "Cuite" (Qt bindings) experiment is doing that: (pooled) weak roots to refer to OCaml values from C-side, and OCaml collections that mirror the foreign dependency graph to enable efficient Garbage Collection. With auto-generated code and hooks in Qt runtime to automate that. The goal was to make foreign data structures behave like OCaml ones. This was quite difficult to achieve, so I think it would be worth providing appropriate API to help bindings implementations. |
The motivation for the proposal and the performance looks good to me. From a multicore perspective:
The one bit of the proposal I worry about in a multi-threaded environment is what will happen if one thread is updating a boxroot with It looks like the authors have thought about the multi-threading needed, have they tried things out against the multicore runtime? Multicore is now up at 4.12 and we have a variant |
An update about ocaml-boxroot as I worked on it these past few days. Let me also answer to a few remarks (I had this on my to-do list since last year). @ctk21 and others: I have improved the support for hooks in multicore at ocaml/ocaml#10965. Following this, I have implemented rudimentary ocaml-boxroot support for OCaml multicore (it still uses a global lock). We would like to implement an efficient version inspired by concurrent allocators; for this we need to efficiently know if we have the domain lock and so I have opened the PR ocaml/ocaml#11138. I would like to try to do it before multicore is released, to ensure that we have all that we need in terms of hooks. (I am eyeing at I think boxroot would be easier, simpler and more feature-full to implement it as part of OCaml multicore. In a first time I am fine with having it as a separate library if the addition/adaptation of hooks keeps going well. The thread-safety of Gabriel has tried with a doubly-linked list implementation (see the The efficiency of pools is only one part, the other is making it efficiently thread-safe, for which we think pool can help too. Regarding weak roots, in terms of Rust API it might be quite different. But let us consider it as something doable if boxroot was integrated with OCaml (or implemented by someone motivated enough by proposing an implementation in the library, which includes creating the appropriate hooks in the OCaml runtime). We (Gabriel and I) have now tried fairly extensively to implement and optimise your proposed generational strategy. It works well and gave a simpler implementation than we had at the time, but we now see that a better (faster, simpler) strategy is to have an inlined fast path for allocation and deallocation that does not even tests for This amounts to moving the We have observed that inlining has a much greater impact than the choice of generational strategy. The (not so) naive strategy has creation and deletion functions that take 10-ish x86-64 instructions, compared to 30-40ish with your strategy (implemented the best I could), and proportionally fewer branches (and only statically predictable ones). In the context inlined code, I imagine that the gains in terms of code size and branch prediction can be important. (Our benchmarks are fairly small in terms of code size, so not well-suited to test this hypothesis; nevertheless we get slightly better performance with inlining alone than with inlining+remembered set.) However, the simplicity of your approach (first implemented by Gabriel) pushed me to reconsider the complicated scheme I had before. Please see https://gitlab.com/ocaml-rust/ocaml-boxroot/-/merge_requests/43 and linked discussions if you want to know or discuss more. I have a question regarding:
The |
For now, we will try to have a working ocaml-boxroot release with multicore as a library, and add necessary support for it (hooks, etc.). Once we have a better picture towards what integration into OCaml looks like, I will update the RFC. |
Note: we discussed this RFC again at our last development meeting, and the consensus was in favor of us submitting a PR to consider upstreaming a Boxroot implementation. Note that this still requires an important amount of work, as we want to revisit our implementation to reuse existing runtime code where possible. |
Rendered: Movable OCaml roots for a more efficient and more flexible FFI