-
Notifications
You must be signed in to change notification settings - Fork 176
Status on cloudpickle non-determinism #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The pickle instructions you're displaying seems like reconstruction instruction for |
We could introduce a constructor parameter to implement a slower, deterministic version of cloudpickle. But, indeed I am not sure how we could do that with the subclass of the C-implementation of the CPython Note that |
Cloudpickle's Pickler class either inherits from pickle.Pickler. pickle.Pickler is either the C implementation of the CPython pickler or a pure-Python pickler. Only the pure-Python pickler supports customizing how built-in types are pickled. This change introduces a PurePythonPickler class which inherits from pickle._Pickler and supports customizing how built-in types are pickled. The Pickler class continues to inherit from the faster C implementation when it is available. Providing a means of customizing how built-in types are pickled enables users to implement deterministic pickling for set and frozenset. See: cloudpipe#453
Apache Beam uses either cloudpickle or dill to save code as part of a workflow graph specification. Google uses Apache Beam with a best-effort cache to avoid starting workflows from scratch after they get interrupted. Increasing the determinism of pickling increases the cache hit rate, saves resources, and avoids delays from starting workflows again from scratch. Apache Beam is transitioning from dill to cloudpickle and would like to contribute two changes to increase the pickling determinism:
Ideally these changes can be part of cloudpickle so Apache Beam can minimize the changes to its vendored copy. Note: the goal is not to guarantee complete determinism. Mostly-deterministic pickling is useful enough. |
Hello,
@ogrisel mentioned in this comment (#385 (comment)):
However, @ogrisel also pushed a PR (#428) which was released as part of cloudpickle 2.0.0 that tried to address non determinism owing to dictionary ordering.
I wanted to confirm what is the official status of the project regarding non determinism because I am still seeing non deterministic pickles in cloudpickle 2.0.0
Here is the
pickletools.dis
outputs of a function:pickle of a function on second attempt:
As you can see, the entries are all the same, but shuffled around.
This function is part of a large project, so unfortunately I can't produce a short test case right now.
Notice that kubeflow pipelines implement caching by making sure that pickle of the function hasn't changed. (there is an option to not use pickle as well, but it has its own problems). Having a non deterministic cloudpickle invalidates the cache every time making that feature useless.
Thanks.
The text was updated successfully, but these errors were encountered: