-
Notifications
You must be signed in to change notification settings - Fork 176
cythonized pydantic objects in __main__ cannot be pickled #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please edit the bug report to include the full traceback? |
Also is this problem happening with the current master branch of cloudpickle? |
I believe this was fixed by #409 as I cannot reproduce anymore. We still need to release though. |
I still get the same error using the cloudpickle version from The fix from #409 only seems to target Python version < 3.7. |
Edited to use cloudpickle from This issue should be reopened. The difference between environments and likely why @ogrisel was unable to reproduce this is because pydantic can be installed with or without Cython support. The Cython version of Pydantic is unsurprisingly significantly faster than the pure-Python version and is also the default install (at least for platforms for which wheels exist). Here are two examples using virtualenv that should be reproducible, using the same script as @marco-neumann-by defined initially: # example.py
import cloudpickle
import pydantic
import pickle
class Bar(pydantic.BaseModel):
a: int
pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below Non-cython PydanticNote that the virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic --no-binary pydantic Here you can tell that there are no cython files:
And the example passes without issue
Cython-based PydanticNow we install pydantic without use of deactivate
rm -rf .venv
virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic Now you can see that there are built C libraries included with Pydantic:
And running our example again, we can see that it fails:
|
I can also reproduce this, however: # example.py
import cloudpickle
import pickle
from models import Bar
pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below # models.py
import pydantic
class Bar(pydantic.BaseModel):
a: int This works fine, so a quick workaround is to always define Pydantic models in a separate file. |
I'm still having this issue in cloudpickle 2.0.0 |
@ogrisel I am also still seeing this issue in 2.0.0. The workaround in #408 (comment) works for me, but I believe this issue should be reopened. |
I have this issue with pydantic and pyspark. ../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/pandas/map_ops.py:91: in mapInPandas
udf_column = udf(*[self[col] for col in self.columns])
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:276: in wrapper
return self(*args)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:249: in __call__
judf = self._judf
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:215: in _judf
self._judf_placeholder = self._create_judf(self.func)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:224: in _create_judf
wrapped_func = _wrap_function(sc, func, self.returnType)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:50: in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/rdd.py:3345: in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/serializers.py:458: in dumps
return cloudpickle.dumps(obj, pickle_protocol)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:73: in dumps
cp.dump(obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pyspark.cloudpickle.cloudpickle_fast.CloudPickler object at 0x7ff5f0410700>
obj = (<function test_graphlet_etl.<locals>.horror_to_movie at 0x7ff5d0e81480>, StructType([StructField('entity_id', StringT...ld('length', LongType(), False), StructField('gross', LongType(), False), StructField('rating', StringType(), False)]))
def dump(self, obj):
try:
> return Pickler.dump(self, obj)
E _pickle.PicklingError: Can't pickle <cyfunction str_validator at 0x7ff5b0461220>: it's not the same object as pydantic.validators.str_validator
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:602: PicklingError |
I've just been bitten by this. @ogrisel, can we reopen this issue? The workaround is not an option if you are defining your objects inside a jupyter notebook. |
@brettc as a workaround, you can define custom serializers to pack and unpack pydantic objects. This might help your use case. |
@simon-mo thanks for the tip -- this looks very promising! The error occurs for me when I'm using dask, so I guess you had the same issues in ray. (BTW, ray is amazing. I chose dask for this job because ray seemed like overkill). |
I'm still struggling to find a workaround for this issue. My code is not directly defining any pydantic types (although it is used by dependent libraries). Is there a version upgrade/downgrade that might be the cause? Unclear on where the actual issue is occuring. In my case it looks to be in the chain of uvicorn and kserve:
|
This still happens. I have to define pydantic models in another file, otherwise I get this error. Even in a simple file where I define a pydantic param class and a Ray actor with a single method, this happens. Using the latest ray, pydantic, etc. |
I agree this issue still exists and I believe it is actually fixed in pydantic 2.5 (see issue and PR) if you run your script with Python. An issue still exists inside Jupyter/IPython pydantic/pydantic#8232. If you get a similar error like the one below, it likely means your are using
In this case, the simplest work-around seems to define your pydantic model in a separate file as noted in #408 (comment) |
Can someone remind me of what it means if this is fixed? I think it means Spark can serialize numpy arrays? |
Abstract
The following code snipped fails with cloudpickle but works with stock pickle if pydantic is cythonized (either via a platform-specific wheel or by having cython installed when calling
setup.py
):When using the file via main:
The error message is:
Note that the issue does NOT appear when a non-cythonized pydantic version is used.
Also note that the issue does NOT appear when the file is not
__main__
, for example:Environment
Technical Background
In contrast to pickle, cloudpickle pickles the actual class when it resides in
__main__
, see the following note in the README:I THINK that might be the reason why this happens. What's somewhat weird is that the object in question is
pydantic.validators.int_validator
which CAN actually be pickled:References
This was first reported in #403 here.
The text was updated successfully, but these errors were encountered: