-
Notifications
You must be signed in to change notification settings - Fork 176
Add option to treat resources from specific modules as dynamic #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Create a set that will be used to declare modules whose resources should be treated as dynamic during pickling. The default behaviour is to reference module-backed resources, instead of pickling them, which is inconvenient in certain use cases.
register_dynamic_module takes a module object or a module name and adds it to the _CUSTOM_DYNAMIC_MODULES_BY_NAME set. Registered modules can be remove with unregister_dynamic_module.
Codecov Report
@@ Coverage Diff @@
## master #391 +/- ##
==========================================
- Coverage 91.40% 89.59% -1.82%
==========================================
Files 3 3
Lines 640 663 +23
Branches 134 139 +5
==========================================
+ Hits 585 594 +9
- Misses 34 44 +10
- Partials 21 25 +4
Continue to review full report at Codecov.
|
_is_dynamic_module returns True if the specified module is in the set of modules that should be considered dynamic for pickling purposes. If submodules is True, then the module's parent modules will be searched to see if they were registered as dynamic.
Add a condition while looking up modules to see if the module is registered as dynamic, returning None if so. This means resources belonging to modules registered as dynamic will be treated as if they had no module, or belonged to __main__.
adfb986
to
f789f2f
Compare
This seems like a valid use case. For testing, one solution would be to pickle a function defined in a module registered as dynamic, load the pickled function in a subprocess and check that the module of that function is not in Have a look at test that use On top of the missing test, I think it would also be nice to be able to register dynamic modules on a specific |
Ah, I totally forgot about this PR. 😬 I'll try to find some time this week to add tests. |
Please do, this PR would be extremely useful to me: https://stackoverflow.com/questions/64848079/sharing-a-dask-cluster-between-projects-with-different-module-versions |
@kinghuang how would this work in Dask Distributed if 2 different clients connected to the same scheduler/workers and used at the same time different versions of the same module (hence different source code but same names)? Would there be some issue for name clashing, or each pickled source is managed separately during computation for their respective task? |
Use-case and proposed implementation look reasonable, I can make a deeper review once you added some tests @kinghuang. About naming: I've been trying to phase out the term I would be tempted to use the terms |
Hello, I wonder whether there is any update on this pull request? Thanks! |
Hello. Are there plans to merge this branch to master? |
Can somebody please help this PR to be accepted? |
I might give the tests a shot, but I have some questions on the requirements:
def func():
from . import baz
return baz.something() Should
|
Of course we are not targeting to allow bizarre calls like the one you mentioned with tensorflow (nor cloudpickle), this is way out of the scope of this project, and shouldn't be considered at all. All we want to support is the functionality of the developer on the cloud in real time, with pure python, and without him needing to upload his/her code to the cloud for every single change. Allowing to send the code by demand. Anything that's more bizarre that requires installations and heavy stuff, can still be pickle by the reduce_ex function, and should be self implemented. |
Stubmled onto this after trying to pickle up a local function that calls another local function thats in different file. This would be great for our use case of saying "These modules are local code, pickle these so you dont try to import them on the cloud". Any movement on this? |
Currently working on a few projects with mlflow. It's been hard to find a way to upload extra functions that I need to run my models on the cloud. This would make life much easier..! |
Given there hasnt been movement on the PR for a few months, Ive reimplemented the code changes and added tests over in PR417 |
Closing in favor of #417. |
Cloudpickle generally considers objects that have a module (i.e., the
__module__
attribute) that is insys.modules
to be importable at unpickling time, and only stores a reference to the object instead of fully pickling the underlying code/data. This behaviour is inconvenient in specific cases. This PR adds an option for registering specific modules as "dynamic", so that cloudpickle will fully pickle objects belonging to module instead of just a reference.Problem to solve
Dask.distributed
In Dask.distributed, code is pickled and sent from clients to workers for processing. For example, you might apply a function to every row in a dataframe like so.
The lambda function will get pickled and sent to workers. This is fine for simple, self contained lambdas. But, if the Dask code is organized in a module, and the apply function is passed a function in the module, then the module has to be installed on Dask workers because cloudpickle will only send a reference to the function.
This is highly inconvenient as
sum_everything
in this example is just a part of the Dask graph definition, and not a shared piece of code meant to be distributed and installed on the Dask workers.Dagster
Dagster pipelines can be defined in Python modules and run on remote clusters. Similar to the Dask use case, if a pipeline solid makes a reference to a function in its module instead of being written all inline, then that function will only get pickled as a reference. Again, this makes it necessary to distribute a module that's not meant to be distributed, just to satisfy the pickling process.
Interactive Development
Pickling code from modules is also convenient during development, where it can be cumbersome or time-consuming to build and install modules repeatedly in remote environments.
Solution
This PR adds a
register_dynamic_module(module)
function that takes in a module or name of a module to treat as dynamic. The module's named is stored in a set named_CUSTOM_DYNAMIC_MODULES_BY_NAME
.For the Dask example above, let's say the code was in a module named
some_module.info
. Before computing the Dask graph, the module can be registered as dynamic with the following.In
_lookup_module_and_qualname(obj, name=None)
, a new condition is added after the check formodule_name is None
andmodule_name == "__main__"
to see if_is_dynamic_module(module_name)
returns True. If so, thenNone
is returned, and theobj
is fully pickled just like theNone
and__main__
cases.The
_is_dynamic_module(module, submodules=True)
function will return true if themodule
argument is in the_CUSTOM_DYNAMIC_MODULES_BY_NAME
set. Ifsubmodules
is true (the default), then the function will also return true if it is a submodule of a registered dynamic module.Given the submodule handling, the entire
foo
could be registered as dynamic, which would coverfoo.bar
and any other submodules.Testing
I'm not sure how best to write test case(s) for this. For local spot testing, I added a submodule under cloudpickle named
test
, and verified things like so.Other
I've been using a variation of this, written as a runtime patch on cloudpickle, for developing Dask code. It's worked well for me!
Closes #206