Add option to treat resources from specific modules as dynamic #391

kinghuang · 2020-07-08T21:20:50Z

Cloudpickle generally considers objects that have a module (i.e., the __module__ attribute) that is in sys.modules to be importable at unpickling time, and only stores a reference to the object instead of fully pickling the underlying code/data. This behaviour is inconvenient in specific cases. This PR adds an option for registering specific modules as "dynamic", so that cloudpickle will fully pickle objects belonging to module instead of just a reference.

Problem to solve

Dask.distributed

In Dask.distributed, code is pickled and sent from clients to workers for processing. For example, you might apply a function to every row in a dataframe like so.

df.apply(lambda x: x.sum(), axis=1)

The lambda function will get pickled and sent to workers. This is fine for simple, self contained lambdas. But, if the Dask code is organized in a module, and the apply function is passed a function in the module, then the module has to be installed on Dask workers because cloudpickle will only send a reference to the function.

def sum_everything(row):
    return sum(row)

df.apply(sum_everything, axis=1)

This is highly inconvenient as sum_everything in this example is just a part of the Dask graph definition, and not a shared piece of code meant to be distributed and installed on the Dask workers.

Dagster

Dagster pipelines can be defined in Python modules and run on remote clusters. Similar to the Dask use case, if a pipeline solid makes a reference to a function in its module instead of being written all inline, then that function will only get pickled as a reference. Again, this makes it necessary to distribute a module that's not meant to be distributed, just to satisfy the pickling process.

Interactive Development

Pickling code from modules is also convenient during development, where it can be cumbersome or time-consuming to build and install modules repeatedly in remote environments.

Solution

This PR adds a register_dynamic_module(module) function that takes in a module or name of a module to treat as dynamic. The module's named is stored in a set named _CUSTOM_DYNAMIC_MODULES_BY_NAME.

For the Dask example above, let's say the code was in a module named some_module.info. Before computing the Dask graph, the module can be registered as dynamic with the following.

import cloudpickle
from foo import bar

cloudpickle.register_dynamic_module(bar)        # Method 1: Use module object
cloudpickle.register_dynamic_module("foo.bar")  # Method 2: Specify module name
cloudpickle.register_dynamic_module("foo")      # Alternative: Pass in parent module or name

In _lookup_module_and_qualname(obj, name=None), a new condition is added after the check for module_name is None and module_name == "__main__" to see if _is_dynamic_module(module_name) returns True. If so, then None is returned, and the obj is fully pickled just like the None and __main__ cases.

The _is_dynamic_module(module, submodules=True) function will return true if the module argument is in the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. If submodules is true (the default), then the function will also return true if it is a submodule of a registered dynamic module.

Given the submodule handling, the entire foo could be registered as dynamic, which would cover foo.bar and any other submodules.

Testing

I'm not sure how best to write test case(s) for this. For local spot testing, I added a submodule under cloudpickle named test, and verified things like so.

root@fbba985331f1:/project# python
Python 3.8.3 (default, Jun  9 2020, 17:39:39) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cloudpickle
>>> from cloudpickle import test
>>> cloudpickle.dumps(test.demo)
b'\x80\x05\x95\x1d\x00\x00\x00\x00\x00\x00\x00\x8c\x10cloudpickle.test\x94\x8c\x04demo\x94\x93\x94.'
>>> cloudpickle.register_dynamic_module(test)
>>> cloudpickle.dumps(test.demo)
b'\x80\x05\x95"\x02\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\r_builtin_type\x94\x93\x94\x8c\nLambdaType\x94\x85\x94R\x94(h\x02\x8c\x08CodeType\x94\x85\x94R\x94(K\x00K\x00K\x00K\x00K\x01KCC\x04d\x01S\x00\x94N\x8c\x01a\x94\x86\x94))\x8c%/project/cloudpickle/test/__init__.py\x94\x8c\x04demo\x94K\x01C\x02\x00\x01\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94\x8c\x10cloudpickle.test\x94\x8c\x08__name__\x94h\x13\x8c\x08__path__\x94]\x94\x8c\x19/project/cloudpickle/test\x94a\x8c\x08__file__\x94\x8c%/project/cloudpickle/test/__init__.py\x94uNNNt\x94R\x94\x8c\x1ccloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x1b}\x94}\x94(h\x14h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x13\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0.'

Other

I've been using a variation of this, written as a runtime patch on cloudpickle, for developing Dask code. It's worked well for me!

Closes #206

Create a set that will be used to declare modules whose resources should be treated as dynamic during pickling. The default behaviour is to reference module-backed resources, instead of pickling them, which is inconvenient in certain use cases.

register_dynamic_module takes a module object or a module name and adds it to the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. Registered modules can be remove with unregister_dynamic_module.

codecov · 2020-07-08T21:21:44Z

Codecov Report

Merging #391 into master will decrease coverage by 1.81%.
The diff coverage is 52.17%.

@@            Coverage Diff             @@
##           master     #391      +/-   ##
==========================================
- Coverage   91.40%   89.59%   -1.82%     
==========================================
  Files           3        3              
  Lines         640      663      +23     
  Branches      134      139       +5     
==========================================
+ Hits          585      594       +9     
- Misses         34       44      +10     
- Partials       21       25       +4

Impacted Files	Coverage Δ
cloudpickle/cloudpickle.py	`83.38% <52.17%> (-3.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da2a604...f789f2f. Read the comment docs.

_is_dynamic_module returns True if the specified module is in the set of modules that should be considered dynamic for pickling purposes. If submodules is True, then the module's parent modules will be searched to see if they were registered as dynamic.

Add a condition while looking up modules to see if the module is registered as dynamic, returning None if so. This means resources belonging to modules registered as dynamic will be treated as if they had no module, or belonged to __main__.

ogrisel · 2020-07-15T08:51:03Z

This seems like a valid use case.

For testing, one solution would be to pickle a function defined in a module registered as dynamic, load the pickled function in a subprocess and check that the module of that function is not in sys.modules (the module has not been imported in the subprocess).

Have a look at test that use from testutils import subprocess_worker for inspiration.

On top of the missing test, I think it would also be nice to be able to register dynamic modules on a specific CloudPickler instance for users who would like to write code without module level side effects. WDYT?

kinghuang · 2020-11-15T21:02:22Z

Ah, I totally forgot about this PR. 😬 I'll try to find some time this week to add tests.

bonelli · 2020-11-15T21:10:38Z

Ah, I totally forgot about this PR. 😬 I'll try to find some time this week to add tests.

Please do, this PR would be extremely useful to me: https://stackoverflow.com/questions/64848079/sharing-a-dask-cluster-between-projects-with-different-module-versions

bonelli · 2020-11-16T15:37:16Z

@kinghuang how would this work in Dask Distributed if 2 different clients connected to the same scheduler/workers and used at the same time different versions of the same module (hence different source code but same names)?

Would there be some issue for name clashing, or each pickled source is managed separately during computation for their respective task?

pierreglaser · 2020-11-17T22:03:24Z

Use-case and proposed implementation look reasonable, I can make a deeper review once you added some tests @kinghuang.

About naming: I've been trying to phase out the term dynamic which is not the defining characteristic that should trigger a switch between the two cloudpickle serialization logics (by attribute/by value), see #357.

I would be tempted to use the terms deep/shallow serialization to distinguish between the two logics, and thus to use something like cloudpickle.force_deep_serialization(module_list). But I'm open to suggestions.

jiangtianli91 · 2021-01-11T22:09:47Z

Hello, I wonder whether there is any update on this pull request? Thanks!

alexbui91 · 2021-01-13T09:35:25Z

Hello. Are there plans to merge this branch to master?

michaeldoron59 · 2021-02-14T21:02:08Z

Can somebody please help this PR to be accepted?
I would be happy to do so, but I lack the technical knowledge about what should be done.
Thanks

nsmith- · 2021-02-18T17:36:55Z

I might give the tests a shot, but I have some questions on the requirements:

Suppose foo.bar contains a function like:

def func():
    from . import baz
    return baz.something()

Should cloudpickle.register_dynamic_module("foo.bar") imply that all of foo is pickled so that baz can be accessed or is it up to the user to ensure the dynamic module is registered at the appropriate scope depth?

Is there any distinction to be made whether foo is a distributed package (i.e. pkg_resources.get_distribution("foo") returns something) or simply a file or directory in sys.path? Do we want to support pickling distributions? What if they are not pure python? Basically we probably want to preempt the question "Why can't I cloudpickle.register_dynamic_module("tensorflow")" or something similarly outrageous :)

michaeldoron59 · 2021-02-18T20:55:41Z

I might give the tests a shot, but I have some questions on the requirements:
* Suppose `foo.bar` contains a function like:
def func():
    from . import baz
    return baz.something()
Should cloudpickle.register_dynamic_module("foo.bar") imply that all of foo is pickled so that baz can be accessed or is it up to the user to ensure the dynamic module is registered at the appropriate scope depth?
* Is there any distinction to be made whether `foo` is a distributed package (i.e. `pkg_resources.get_distribution("foo")` returns something) or simply a file or directory in `sys.path`? Do we want to support pickling distributions? What if they are not pure python? Basically we probably want to preempt the question "Why can't I `cloudpickle.register_dynamic_module("tensorflow")`" or something similarly outrageous :)

In my opinion, the register_dynamic_module should pickle all the necessities of the pickled object, like globals and locals, and the code of the pickled object. It shouldn't go further and pickle every single object of the module. Only what's relevant to the pickled object, and it should be recursive, so that every call to these objects won't cause any crashes.
Regarding your foo.bar.baz example, I think it should crash in case the baz object won't be installed on the server side. In this case as a developer I would make the import outside of the func scope, enabling cloudpickle to recognize it in the globals scope.

Of course we are not targeting to allow bizarre calls like the one you mentioned with tensorflow (nor cloudpickle), this is way out of the scope of this project, and shouldn't be considered at all. All we want to support is the functionality of the developer on the cloud in real time, with pure python, and without him needing to upload his/her code to the cloud for every single change. Allowing to send the code by demand. Anything that's more bizarre that requires installations and heavy stuff, can still be pickle by the reduce_ex function, and should be self implemented.
Again, registering a module doesn't mean we actually want to send the whole module, it just means that in case we are pickling anything, it should serialize the code and not assume that it exists on the server side.

Samreay · 2021-03-09T11:02:45Z

Stubmled onto this after trying to pickle up a local function that calls another local function thats in different file. This would be great for our use case of saying "These modules are local code, pickle these so you dont try to import them on the cloud". Any movement on this?

mam619 · 2021-03-09T11:40:50Z

Currently working on a few projects with mlflow. It's been hard to find a way to upload extra functions that I need to run my models on the cloud. This would make life much easier..!

Samreay · 2021-04-20T09:21:11Z

Given there hasnt been movement on the PR for a few months, Ive reimplemented the code changes and added tests over in PR417

ogrisel · 2021-06-22T16:23:56Z

Closing in favor of #417.

kinghuang added 2 commits July 8, 2020 12:06

Add functions for registering dynamic modules

0858ab2

register_dynamic_module takes a module object or a module name and adds it to the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. Registered modules can be remove with unregister_dynamic_module.

kinghuang added 2 commits July 8, 2020 15:29

Check for dynamic module

f789f2f

Add a condition while looking up modules to see if the module is registered as dynamic, returning None if so. This means resources belonging to modules registered as dynamic will be treated as if they had no module, or belonged to __main__.

kinghuang force-pushed the 206-custom-dynamic-modules branch from adfb986 to f789f2f Compare July 8, 2020 21:29

kinghuang mentioned this pull request Jul 8, 2020

[New feature] full code pickle for arbitrary modules #206

Closed

kinghuang changed the title ~~WIP: Add option to treat resources from specific modules as dynamic~~ Add option to treat resources from specific modules as dynamic Jul 8, 2020

pierreglaser mentioned this pull request Nov 17, 2020

Class-level attributes dropped during serialization across processes #398

Closed

jrbourbeau mentioned this pull request Nov 17, 2020

Imported object's method cannot be run properly when submited to dask client dask/distributed#4233

Closed

This was referenced Mar 25, 2021

Import not working in a cluster ray-project/ray#3116

Closed

How to submit a task which is implemented in a separate module? ray-project/ray#7133

Closed

Samreay mentioned this pull request Apr 20, 2021

Add ability to register modules to be deeply serialized #417

Merged

ogrisel closed this Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to treat resources from specific modules as dynamic #391

Add option to treat resources from specific modules as dynamic #391

kinghuang commented Jul 8, 2020 •

edited

Loading

codecov bot commented Jul 8, 2020 •

edited

Loading

ogrisel commented Jul 15, 2020

kinghuang commented Nov 15, 2020

bonelli commented Nov 15, 2020

bonelli commented Nov 16, 2020 •

edited

Loading

pierreglaser commented Nov 17, 2020

jiangtianli91 commented Jan 11, 2021

alexbui91 commented Jan 13, 2021

michaeldoron59 commented Feb 14, 2021

nsmith- commented Feb 18, 2021 •

edited

Loading

michaeldoron59 commented Feb 18, 2021 •

edited

Loading

Samreay commented Mar 9, 2021

mam619 commented Mar 9, 2021

Samreay commented Apr 20, 2021

ogrisel commented Jun 22, 2021

Add option to treat resources from specific modules as dynamic #391

Add option to treat resources from specific modules as dynamic #391

Conversation

kinghuang commented Jul 8, 2020 • edited Loading

Problem to solve

Dask.distributed

Dagster

Interactive Development

Solution

Testing

Other

codecov bot commented Jul 8, 2020 • edited Loading

Codecov Report

ogrisel commented Jul 15, 2020

kinghuang commented Nov 15, 2020

bonelli commented Nov 15, 2020

bonelli commented Nov 16, 2020 • edited Loading

pierreglaser commented Nov 17, 2020

jiangtianli91 commented Jan 11, 2021

alexbui91 commented Jan 13, 2021

michaeldoron59 commented Feb 14, 2021

nsmith- commented Feb 18, 2021 • edited Loading

michaeldoron59 commented Feb 18, 2021 • edited Loading

Samreay commented Mar 9, 2021

mam619 commented Mar 9, 2021

Samreay commented Apr 20, 2021

ogrisel commented Jun 22, 2021

kinghuang commented Jul 8, 2020 •

edited

Loading

codecov bot commented Jul 8, 2020 •

edited

Loading

bonelli commented Nov 16, 2020 •

edited

Loading

nsmith- commented Feb 18, 2021 •

edited

Loading

michaeldoron59 commented Feb 18, 2021 •

edited

Loading