Skip to content

Improve dynamic module recognition #357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pierreglaser
Copy link
Member

Fixes #354

To emulate package capabilities while being a single file, extension modules (for instance a mod.so file) can dynamically create a module object (most likely using the *package_name*.*parent_module_name*.*submodule_name* naming convention to be compatible with import semantics).

Internally, they will use the PyImport_AddModule(submodule_qualified_name), which creates a module object and adds it to sys.package. Such a module is a dynamic module, but is also importable, which shows that, despite what cloudpickle's logic seems to imply, those are not mutually exclusive attributes.

This PR adds support to treat these dynamic modules as importable. We probably should rename _is_dynamic to _is_importable, as the latter quality is really what matters here.

@codecov
Copy link

codecov bot commented Apr 1, 2020

Codecov Report

Merging #357 into master will decrease coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #357      +/-   ##
==========================================
- Coverage   92.95%   92.91%   -0.04%     
==========================================
  Files           2        2              
  Lines         809      805       -4     
  Branches      164      164              
==========================================
- Hits          752      748       -4     
  Misses         29       29              
  Partials       28       28              
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 91.76% <100.00%> (-0.06%) ⬇️
cloudpickle/cloudpickle_fast.py 95.72% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1b1007...a080dd9. Read the comment docs.

@pierreglaser
Copy link
Member Author

@henryiii this should hopefully fix the bug you reported. On my machine I get:

In [1]: import boost_histogram._core.hist as h

In [2]: h
Out[2]: <module 'boost_histogram._core.hist'>

In [3]: from cloudpickle import loads, dumps

In [4]: loads(dumps(h))
Out[4]: <module 'boost_histogram._core.hist'>

In [5]: assert loads(dumps(h)) is h

If you can, feel free to checkout this PR locally to ensure this fixes your bug.

@henryiii
Copy link

henryiii commented Apr 2, 2020

I'll check this within a few hours, thank you!

@henryiii
Copy link

henryiii commented Apr 2, 2020

Looks good to me,

import boost_histogram as bh
import cloudpickle
h = bh.Histogram(bh.axis.Regular(50, 0, 20))
h.fill([1 ,2 , 3, 4, 5])
cloudpickle.dumps(h)

works on this branch!

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment in passing:

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments and suggestions for improvements.

@pierreglaser
Copy link
Member Author

pierreglaser commented Apr 22, 2020

mmh, I just realised the importability of a module as I referred it to in this PR colludes 2 concepts:

importability via module attribute lookup

associated semantic from package.module import submodule. Python typically does an attribute lookup inside module to find submodule. Because this syntax does not require submodule to be a module (it could be any object), any submodule that is registered as an attribute of its parent (but not necessarily added to sys.module will be importable this way.

importability using module-specific import machinery:

associated semantic: import package.module.submodule. in this case, submodule is expected to be an importable module, and thus must be either file-backed, or added to sys.module during the import of its parent.

This PR actually confuses the two: _module_is_importable checks the validity of the module w.r.t. to the first semantic (from ... import ...), but if it evaluates to true, cloudpickle will rely on the second semantic (through __import__ in subimport)...

@ogrisel
Copy link
Contributor

ogrisel commented Apr 27, 2020

Indeed, it would be better to have the code make the distinction between the 2 notions and only check for the one needed where appropriate.

@henryiii
Copy link

What's the status of this? I could put in a hack in boost-histogram that would get this working, but I'd rather it be fixed properly for everyone.

@pierreglaser pierreglaser force-pushed the fix-pickling-submodule-in-sharedobject-file branch from fb6d34a to da60d2d Compare May 20, 2020 21:07
@pierreglaser
Copy link
Member Author

rebased.

@pierreglaser
Copy link
Member Author

pierreglaser commented May 20, 2020

By iterating a bit more and adding some test cases, my current opinion on this PR has a little bit changed.

If #354 appeared in the first place, it is because cloudpickle does not assume that a module being inside sys.modules (used as a cache by importlib) is a sufficient condition to be importable. I'm unsure of what is the rationale backing this, and what use case it supports.

As of now _module_is_importable fixes another case that was never reported in the first place, which is to treat as importable a dynamic module imported using a from ... import. But the more I think about this, the more I think we should not do this: indeed, module objects do not have __module__ attributes, and thus it is impossible to cheaply know from where a dynamic module was imported unless we adopt (as we do now in _module_is_importable) a best-effort, brittle heuristic by trusting the submodule name, provided that this name was chosen to follow a hierarchical parent_module_name.submodule_name convention (eventhough it has no effect whatsoever on the behavior of importlib).
I wonder whether cloudpickle should try to outsmart traditional object pickling by trying to pickle such dynamic modules as importable - we don't try to pickle by attribute all objects accessed using from module import MyObject anyway, so why do it for dynamic modules? (this echoes my previous comment on cloudpickle detect from ... import semantics)

The more I think about this, the more I believe cloudpickle should treat modules inside sys.modules as importable - _module_is_importable would simply become a wrapper around module.__name__ in sys.modules.

WDYT @ogrisel?

FYI: I tried having _module_is_importable wrap module.__name__ in sys.modules -- the test suite just passes :)

@pierreglaser pierreglaser added the ci downstream Signal the CI to run the test suite of all registered cloudpickle downstream projects. label May 24, 2020
@pierreglaser
Copy link
Member Author

The distributed downstream build failure looks unrelated.

@pierreglaser
Copy link
Member Author

pinging downstream libraries developers: @suquark @mrocklin @HyukjinKwon @ogrisel - I would like to have your opinion before merging this:

This PR proposes to simplify module pickling; when cloudpickle is asked to pickle a module, it checks if this module is importable (and not dynamically created) by looking for it inside sys.modules, and if so, cloudpickle pickle the said module by writing a sequence of instruction summing up to import <module> in the pickle string.
Previously, cloudpickle module's importability check was way more complicated than simply looking inside sys.modules. But my opinion is that looking inside sys.modules should be simpler, and more failure-proof.

Does the assumption module in sys.modules.values() <=> module is importable sound OK to you? Or is there any use case of yours where this assumption break (for instance knowingly removing an importable module from sys.modules)

@mrocklin
Copy link
Contributor

mrocklin commented May 25, 2020 via email

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I agree that just checking sys,modules to known whether a module should be importable or not should be a good enough heuristic.

Code that generates non-importable dynamic modules registered in sys.modules is probably pathological and I don't think we should try to support it.

if _is_dynamic(obj):
if _is_importable(obj):
return subimport, (obj.__name__,)
else:
obj.__dict__.pop('__builtins__', None)
return dynamic_subimport, (obj.__name__, vars(obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be named to _make_dynamic_module as we actually do not import the module at all.

We should still keep dynamic_subimport as an alias for _make_dynamic_module to preserve backward compat with old pickle files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subimport function could also be renamed to _import_module to be more explicit. But we would also need to keep an alias named subimport pointing to _import_module for backward compat as well.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 5, 2020

#377 confirms that the failure observed in the distributed tests ("OSError: [Errno 101] Network is unreachable") is not related to the changes in this PR.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 5, 2020

@pierreglaser I put a comment on #357 (comment) but as this name was not introduced in this PR, I am fine with not addressing it here.

Could you please add an entry to the changelog before merging this?

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge!

@ogrisel ogrisel merged commit 0c0b8d4 into cloudpipe:master Jun 8, 2020
@albertcthomas
Copy link

Code that generates non-importable dynamic modules registered in sys.modules is probably pathological and I don't think we should try to support it.

This makes sense.

I investigated a related issue recently where modules are imported from a source file. If you strictly follow the importlib package doc to import a source file you add the module to sys.modules but then if the module is non-importable cloudpickle (and of course pickle) fails. Not adding the module to sys.modules made it work with cloudpickle, although it was not completely clear to me what was happening under the hood. See this comment in paris-saclay-cds/ramp-workflow#232. I was working with cloudpickle<1.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci downstream Signal the CI to run the test suite of all registered cloudpickle downstream projects.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_is_dynamic fails on submodule in .so file
5 participants