Skip to content

cloudpickle breaks dill deserialization across servers. #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wmarshall484 opened this issue Mar 24, 2017 · 3 comments
Closed

cloudpickle breaks dill deserialization across servers. #82

wmarshall484 opened this issue Mar 24, 2017 · 3 comments

Comments

@wmarshall484
Copy link

Following up on this issue on Stackoverflow:

http://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype/43006034#43006034

In a nutshell, with Python 3.5:

Server A imports cloudpickle this causes types.ClassType to become defined.

>>> import types
>>> dir(types)
  ['BuiltinFunctionType',
   'BuiltinMethodType',
   'ClassType',
   'CodeType',
   ...
  ]

Server B does not import cloudpickle, so types.ClassType is left undefined.

>>> import types
>>> dir(types)
  ['BuiltinFunctionType',
   'BuiltinMethodType',
   'CodeType',
   ...
  ]

Objects which are serialized in server A also seem to serialize a reference to ClassType. Then, when they are deserialized on server B, we encounter the following error:

Traceback (most recent call last):
 File "/home/streamsadmin/git/streamsx.topology/test/python/topology/deleteme2.py", line 40, in <module>
   a = dill.loads(base64.b64decode(a.encode()))
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 277, in loads
   return load(file)
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 266, in load
   obj = pik.load()
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 524, in _load_type
   return _reverse_typemap[name]
KeyError: 'ClassType'

I've found a workaround, which you can see on Stackoverflow.

Here's my question: types.ClassType was removed in 3.5, yet cloudpickle re-adds it. Is this strictly necessary? It seems to be having side effects.

@wmarshall484 wmarshall484 changed the title cloudpickle break dill deserialization across servers. cloudpickle breaks dill deserialization across servers. Mar 24, 2017
@tvalentyn
Copy link
Contributor

Bumping this up as users report being hit by these side-effects, for example:

Can cloudpickle maintainers comment on whether it is possible to release a newer version of cloudpickle library that does not create a side-effect after importing cloudpickle?

I think this problem makes it difficult for users to work with libraries that depend on cloudpickle inside interactive environments, when they also interact with other libraries that depend on different picklers in the same session.

@zhitaoli
Copy link

zhitaoli commented Feb 6, 2020

+1 this is also affect TFX users.

@tvalentyn
Copy link
Contributor

Sent: #337

@ogrisel ogrisel closed this as completed in 4a95948 Feb 7, 2020
HyukjinKwon added a commit to apache/spark that referenced this issue Jul 17, 2020
### What changes were proposed in this pull request?

This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py)

### Why are the changes needed?

There are many bug fixes. For example, the bug described in the JIRA:

dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error:

```
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads
    return load(file, ignore)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type
    return _reverse_typemap[name]
KeyError: 'ClassType'
```

See also cloudpipe/cloudpickle#82. This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet.

More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray.

### Does this PR introduce _any_ user-facing change?

Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle.

### How was this patch tested?

Jenkins will test it out.

Closes #29114 from HyukjinKwon/SPARK-32094.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants