Skip to content

Dynamic class reset state on every deserialization #439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sklam opened this issue Sep 9, 2021 · 7 comments
Open

Dynamic class reset state on every deserialization #439

sklam opened this issue Sep 9, 2021 · 7 comments

Comments

@sklam
Copy link

sklam commented Sep 9, 2021

Reproducer:

# Tested with cloudpickle 1.6.0
from cloudpickle import dumps, loads


class Klass:
    classvar = None

def mutator():
    Klass.classvar = 100

def check():
    print("checking....")
    print(f"   Klass.classvar [{hex(id(Klass))}] = {Klass.classvar}")


def failing_case():
    print("Klass", hex(id(Klass)))
    saved = dumps(Klass)
    mutator()
    check()
    loads(saved)
    check()
    loads(saved)
    check()



if __name__ == '__main__':
    failing_case()

Prints:

Klass 0x7fc698719980
checking....
   Klass.classvar [0x7fc698719980] = 100
checking....
   Klass.classvar [0x7fc698719980] = None
checking....
   Klass.classvar [0x7fc698719980] = None

After each loads(saved), the state in Klass is being reset unexpectedly.

This problem can appear like a tricky race condition in distributed, multi-threaded framework, such as Dask. See example https://gist.github.com/sklam/98e7c98ce909e76a3fa7904754db7bd9.

I created a patch for this in the vendored cloudpickle in Numba: numba/numba#7388. Please let me know if there will be problems with the way I am fixing it. If it is okay, I can submit the PR here.

@sklam
Copy link
Author

sklam commented Sep 9, 2021

I forgot to mention that the Klass above is treated as a dynamic class only when the file is run as the __main__ module.

@pierreglaser
Copy link
Member

Hi @sklam,

As your reproducer shows, loading a pickle containing Klass does not affect the id of the Klass in the global namespace: this is because #246, when loading a pickle bytestring containing a serialized (interactively defined) class, cloudpickle checks whether the original nonserialized class already exists in the global namespace of the interpreter loading the pickle. If so, the cloudpickle "reconciles" the serialized class and the original class, e.g. updates the state of the original, already existing existing class with the state of its serialized counterpart contained in the pickle string. This is important for compatibility withisinstance and Enum semantics, see #246 and references therein.

We could make it possible for users to turn off "class tracking", but knowing now the motivations and the benefits of class tracking, do you still think turning it off would be desirable for your use case? Do you have another "fix" in mind?

@sklam
Copy link
Author

sklam commented Sep 10, 2021

@pierreglaser, the id should stay the same. The actual problem happens at

_make_skeleton_class, _class_getnewargs(obj), _class_getstate(obj),
None, None, _class_setstate
. _class_setstate is always called and it reset the state of Klass. It is an unexpected side-effect for cloudpickle.loads() to always reset the class state.

In Numba, I am attempting to patch the class tracking in cloudpickle to bypass the _class_setstate code if the class definition (class_def) is cached: https://github.com/numba/numba/pull/7388/files#diff-b5b29cab05bde889add65b4715bbaef7379165ef9b3fac1e744674276764cd75R128

@pierreglaser
Copy link
Member

pierreglaser commented Sep 10, 2021

reset the state of Klass

what cloudpickle does it that it "updates" the state of an already existing class with the state of its serialized counterpart.

When you try to reconcile a serialized version of Klass and an existing version of Klass into the same Python object, you have to make a choice about which of the two classes states you end up keeping. There is no right or wrong choice here: what is "unexpected behavior" for you is expected behavior for others.

For instance, if the loads(saved) are made by a remote worker receiving a Python class iteratively updated by a user inside a local interactive python session, you want to update the state of the child worker class with the one of the (newer) serialized class. Because the use case I just described is traditionally considered as one of the canonical use cases of cloudpickle, the class reconciling logic we decided to implement follows what is considered "expected behavior" for this use case.

I am not sure of what we should do upstream to support your use case.
As I said, there is no right or wrong choice, and we might want to let users toggle between the two behaviors. @ogrisel, do you have any thoughts on this?

@sklam
Copy link
Author

sklam commented Sep 10, 2021

Consider the case in a distributed system and there are two threads running in a remote process, both of these threads are executing this:

def worker(serialized_bytes):
     Klass = cloudpickle.loads(serialize_bytes)
     Klass.count += 1
     do_something_else()
     return Klass.count

With random delay in starting the execution and in running do_something_else(), Klass maybe reconciled before the read to Klass.count. (This is similar to the case in cell 10 of https://gist.github.com/sklam/98e7c98ce909e76a3fa7904754db7bd9).

The fact that a deserialization in a different thread can at a random time modify a state can lead to very hard to debug situations.

@sklam
Copy link
Author

sklam commented Sep 10, 2021

hm.... the register_pickle_by_value(module)//unregister_pickle(module) feature in cloudpickle 2.0.0 may be able to workaround this problem.

@pierreglaser
Copy link
Member

pierreglaser commented Sep 12, 2021

hm.... the register_pickle_by_value(module)//unregister_pickle(module) feature in cloudpickle 2.0.0 may be able to workaround this problem

My feeling is that it won't help, since this feature allows only to add modules in which serialization by value will be enforced (and not the other way around), while your problem could maybe be solved by serializing Klass by reference (provided that it is importable in the remote processes).

The fact that a deserialization in a different thread can at a random time modify a state can lead to very hard to debug situations.

I agree, but the only possible alternative I'm seeing right now is to add some programmatic switching between the current behavior and a behavior prioritizing the child process. I don't have a definitive opinion on whether such a switch is worth adding, and would need to discuss with other core devs of cloudpickle/distributed frameworks to get a better idea of how much people would benefit from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants