-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any thoughts on supporting something better than pickling? #45
Comments
No thoughts at the moment. I'd love to hear suggestions, and later on contributions. :) I think anything that is switched to should be partial, and not restrict the data returned. Recall that the main use case of this package is to save very long calls - minutes and upwards. |
I know this is an old issue, but my request is somewhat similar, regarding alternative backend support. I will open an issue separately if you recommend! I use https://github.com/uqfoundation/dill to pickle a wider range of objects, and I was hoping So I went ahead and hacked it in, literally adding a This is a very quick hack, not the intended final feature implementation, and more importantly I haven't run any tests whatsoever, except for the one place I'm using the caching right now for a |
If there were a better back-end in place I know I'd benefit from my typical use case: caching API calls for cost and performance saves (e.g. to OpenAI the way I did in this gist). |
I'm interested in helping to implement this, while heeding these recent words of wisdom from a Parsl maintainer.
So my dill hack definitely works in my fork, but I think the level of maintenance available to the project will determine the sum total new feature surface area added here, to facilitate dill as a backend, or other arbitrary backends. I think an extension interface makes sense, where the user code can supply functions which conform to a certain protocol or shape, and then The first idea that comes to mind is allowing custom Also maybe interfacing at the So for the "who's going to maintain this in a year's time" question, I think I could manage to maintain it if it's a simple And for the "what does it break" for those who "aren't using" this or "don't want to", I think this could be implemented in such a way that the default is pickling, and only passing some additional serialization primitives to the Also, while on the topic of extension, I think the hash function extension concept could be neatly extended by designating "handlers" for certain object types. Instead of overriding the entire hash method altogether, users could supply handlers which take an object and return a Hashable, which could be fed into a built-in pre-hasher stage that just applies all hash handlers. Sorry this is a different concept but I encountered it at the same time as the caching backend concept and wanted to address it. Would probably be a different Issue and feature effort altogether. |
I like |
I like the |
Alright, I think can start formulating this. I'll start off in a draft PR this coming weekend, and after this reply I will confine further discussion to the eventual draft PR. I have questions with respect to implementation, but maybe it'll be easier to ask the user-facing questions first. User-facing APIIt seems sensible that a new decorator keyword argument should receive the custom serialization/deserialization logic from the user, which will be used in caching. I think the new argument should receive custom External/internal boundaryIt is expected that user-supplied loader/dumper are coupled, and caches created by this combination of special loader/dumper are expected to be re-used only by this same loader/dumper implementation in future code runs. This may be signaled by embedding this information in the cache file somehow, to detect (and warn? or raise?) when a different loader/dumper are used, or else to silently create a different cache file. Customization of cache file naming can get hairy pretty quickly, if we're not careful a bunch of specifics are going to be embedded in the cache file name and that seems a bit fragile. Both New internalsI guess the minimally-invasive initial implementation would introduce a new core, e.g. To be a bit more specific, it looks like Generalizing existing internalsAll of this would probably mean generalizing the DocumentationThere's also docs changes needed for this, which would document the new parameter and probably a guided example. I would probably use Definition of acceptable user-supplied loaders and dumpersIn addition to this, a specification as to what constitutes acceptable user-supplied loaders/dumpers should be pointed to, e.g. the data stream format laid out in the Maybe there's a way to say, "User-supplied loader/dumpers should conform to ConclusionJust let me know what you think about some of the details given here, and I'll start gradually working towards an implementation. I'm working this in hobby time, so it's probably not going to progress at breakneck speed, at least there will be ample time to think on these details in that case. |
Random thoughts: User-facing API: Yup, a new keyword is in order. And a named tuple makes sense. External/internal boundary: Regarding how to embed information about the de/serializing package in the file, what about the file extension? About the mongo core - yes, I think that is for later. Generalizing existing internals: Good points overall. Documentation: Sounds good. Definition of acceptable user-supplied loaders and dumpers: I think you worry to much about this. We should accept only drop-in replacements for Conclusion: Sounds good. Work at it in your own pace and let me know when you have need for advise. |
Thanks @shaypal5 for redirecting my nervous energy about specification and scope 😅. I'm becoming more interested in contributing lately, and my unfamiliarity with external codebases and processes manifests as overthinking it all. Just continue to nudge me if I'm thinking too much about a particular thing. Regarding embedding loader/dumper information in cache filenames/extensions, should the signature be a hash, or loader/dumper function names? Embedding some kind of hash may be unwieldy and overly-specific, but embedding just the loader/dumper function names may be brittle (e.g. their implementation changes but names stay the same). Maybe a minimal approach is to not encode the custom loader/dumper into the cache file at all, and just let decoding fail if the user changed the loader/dumper between runs. The user can decide how to handle that exception, e.g. reset the cache. This would avoid adding more exception handling or logic to the cache read path, which is probably being called by user code in hot loops. The only issue with this minimal approach could be that data successfully unpickles but is silently corrupted by reading with a slightly different loader. |
Regarding file extensions: I think function names might be not unique. A hash sounds sensible. There should be hashing somewhere there. Thinking about the use cases:
Regarding silent failing: I think your point is crucial. It's too minimal. It's not a promise by So I basically agree with your analysis and stand by your hunch to go the safe route and do this properly. :) |
Great, I'll gradually work these things into the PR. I'll implement the naming scheme in a way that addresses these points. I think you're already embedding a hash in the filename when |
Pickling is about as slow as it can get, some benchmarks here.
Any future plans for adding support for something faster, like msgpack, or even json?
I realize that some choices will require the cachier-wrapped functions to restrict the data that they return, but given how slow pickle is, and that the point of caching is at least in part to speed things up, and specially not to slow them down - I think this would be a good feature.
Any thoughts on this?
The text was updated successfully, but these errors were encountered: