-
Notifications
You must be signed in to change notification settings - Fork 176
cloudpickle breaks dill deserialization across servers. #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Bumping this up as users report being hit by these side-effects, for example:
Can cloudpickle maintainers comment on whether it is possible to release a newer version of cloudpickle library that does not create a side-effect after importing cloudpickle? I think this problem makes it difficult for users to work with libraries that depend on cloudpickle inside interactive environments, when they also interact with other libraries that depend on different picklers in the same session. |
+1 this is also affect TFX users. |
Sent: #337 |
### What changes were proposed in this pull request? This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py) ### Why are the changes needed? There are many bug fixes. For example, the bug described in the JIRA: dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads return dill.loads(s) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type return _reverse_typemap[name] KeyError: 'ClassType' ``` See also cloudpipe/cloudpickle#82. This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet. More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray. ### Does this PR introduce _any_ user-facing change? Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle. ### How was this patch tested? Jenkins will test it out. Closes #29114 from HyukjinKwon/SPARK-32094. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
Following up on this issue on Stackoverflow:
http://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype/43006034#43006034
In a nutshell, with Python 3.5:
Server A imports
cloudpickle
this causestypes.ClassType
to become defined.Server B does not import
cloudpickle
, sotypes.ClassType
is left undefined.Objects which are serialized in server A also seem to serialize a reference to
ClassType
. Then, when they are deserialized on server B, we encounter the following error:I've found a workaround, which you can see on Stackoverflow.
Here's my question:
types.ClassType
was removed in 3.5, yet cloudpickle re-adds it. Is this strictly necessary? It seems to be having side effects.The text was updated successfully, but these errors were encountered: