Skip to content

Commit 343da11

Browse files
SamreaypierreglaserSamuel Hintonogrisel
authored
Add ability to register modules to be deeply serialized (#417)
Co-authored-by: Pierre Glaser <[email protected]> Co-authored-by: Samuel Hinton <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>
1 parent 0c62ae0 commit 343da11

File tree

9 files changed

+503
-29
lines changed

9 files changed

+503
-29
lines changed

Diff for: CHANGES.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,13 @@ dev
66

77
- Python 3.5 is no longer supported.
88

9+
- Support for registering modules to be serialised by value. This will
10+
allow for code defined in local modules to be serialised and executed
11+
remotely without those local modules installed on the remote machine.
12+
([PR #417](https://github.com/cloudpipe/cloudpickle/pull/417))
13+
914
- Fix a side effect altering dynamic modules at pickling time.
1015
([PR #426](https://github.com/cloudpipe/cloudpickle/pull/426))
11-
1216
- Support for pickling type annotations on Python 3.10 as per [PEP 563](
1317
https://www.python.org/dev/peps/pep-0563/)
1418
([PR #400](https://github.com/cloudpipe/cloudpickle/pull/400))

Diff for: README.md

+53
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,59 @@ Pickling a function interactively defined in a Python shell session
6666
85
6767
```
6868

69+
70+
Overriding pickle's serialization mechanism for importable constructs:
71+
----------------------------------------------------------------------
72+
73+
An important difference between `cloudpickle` and `pickle` is that
74+
`cloudpickle` can serialize a function or class **by value**, whereas `pickle`
75+
can only serialize it **by reference**. Serialization by reference treats
76+
functions and classes as attributes of modules, and pickles them through
77+
instructions that trigger the import of their module at load time.
78+
Serialization by reference is thus limited in that it assumes that the module
79+
containing the function or class is available/importable in the unpickling
80+
environment. This assumption breaks when pickling constructs defined in an
81+
interactive session, a case that is automatically detected by `cloudpickle`,
82+
that pickles such constructs **by value**.
83+
84+
Another case where the importability assumption is expected to break is when
85+
developing a module in a distributed execution environment: the worker
86+
processes may not have access to the said module, for example if they live on a
87+
different machine than the process in which the module is being developed.
88+
By itself, `cloudpickle` cannot detect such "locally importable" modules and
89+
switch to serialization by value; instead, it relies on its default mode,
90+
which is serialization by reference. However, since `cloudpickle 1.7.0`, one
91+
can explicitly specify modules for which serialization by value should be used,
92+
using the `register_pickle_by_value(module)`/`/unregister_pickle(module)` API:
93+
94+
```python
95+
>>> import cloudpickle
96+
>>> import my_module
97+
>>> cloudpickle.register_pickle_by_value(my_module)
98+
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by value
99+
>>> cloudpickle.unregister_pickle_by_value(my_module)
100+
>>> cloudpickle.dumps(my_module.my_function) # my_function is pickled by reference
101+
```
102+
103+
Using this API, there is no need to re-install the new version of the module on
104+
all the worker nodes nor to restart the workers: restarting the client Python
105+
process with the new source code is enough.
106+
107+
Note that this feature is still **experimental**, and may fail in the following
108+
situations:
109+
110+
- If the body of a function/class pickled by value contains an `import` statement:
111+
```python
112+
>>> def f():
113+
>>> ... from another_module import g
114+
>>> ... # calling f in the unpickling environment may fail if another_module
115+
>>> ... # is unavailable
116+
>>> ... return g() + 1
117+
```
118+
119+
- If a function pickled by reference uses a function pickled by value during its execution.
120+
121+
69122
Running the tests
70123
-----------------
71124

Diff for: cloudpickle/cloudpickle.py

+106-10
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,9 @@ def g():
8888
# communication speed over compatibility:
8989
DEFAULT_PROTOCOL = pickle.HIGHEST_PROTOCOL
9090

91+
# Names of modules whose resources should be treated as dynamic.
92+
_PICKLE_BY_VALUE_MODULES = set()
93+
9194
# Track the provenance of reconstructed dynamic classes to make it possible to
9295
# reconstruct instances from the matching singleton class definition when
9396
# appropriate and preserve the usual "isinstance" semantics of Python objects.
@@ -124,6 +127,77 @@ def _lookup_class_or_track(class_tracker_id, class_def):
124127
return class_def
125128

126129

130+
def register_pickle_by_value(module):
131+
"""Register a module to make it functions and classes picklable by value.
132+
133+
By default, functions and classes that are attributes of an importable
134+
module are to be pickled by reference, that is relying on re-importing
135+
the attribute from the module at load time.
136+
137+
If `register_pickle_by_value(module)` is called, all its functions and
138+
classes are subsequently to be pickled by value, meaning that they can
139+
be loaded in Python processes where the module is not importable.
140+
141+
This is especially useful when developing a module in a distributed
142+
execution environment: restarting the client Python process with the new
143+
source code is enough: there is no need to re-install the new version
144+
of the module on all the worker nodes nor to restart the workers.
145+
146+
Note: this feature is considered experimental. See the cloudpickle
147+
README.md file for more details and limitations.
148+
"""
149+
if not isinstance(module, types.ModuleType):
150+
raise ValueError(
151+
f"Input should be a module object, got {str(module)} instead"
152+
)
153+
# In the future, cloudpickle may need a way to access any module registered
154+
# for pickling by value in order to introspect relative imports inside
155+
# functions pickled by value. (see
156+
# https://github.com/cloudpipe/cloudpickle/pull/417#issuecomment-873684633).
157+
# This access can be ensured by checking that module is present in
158+
# sys.modules at registering time and assuming that it will still be in
159+
# there when accessed during pickling. Another alternative would be to
160+
# store a weakref to the module. Even though cloudpickle does not implement
161+
# this introspection yet, in order to avoid a possible breaking change
162+
# later, we still enforce the presence of module inside sys.modules.
163+
if module.__name__ not in sys.modules:
164+
raise ValueError(
165+
f"{module} was not imported correctly, have you used an "
166+
f"`import` statement to access it?"
167+
)
168+
_PICKLE_BY_VALUE_MODULES.add(module.__name__)
169+
170+
171+
def unregister_pickle_by_value(module):
172+
"""Unregister that the input module should be pickled by value."""
173+
if not isinstance(module, types.ModuleType):
174+
raise ValueError(
175+
f"Input should be a module object, got {str(module)} instead"
176+
)
177+
if module.__name__ not in _PICKLE_BY_VALUE_MODULES:
178+
raise ValueError(f"{module} is not registered for pickle by value")
179+
else:
180+
_PICKLE_BY_VALUE_MODULES.remove(module.__name__)
181+
182+
183+
def list_registry_pickle_by_value():
184+
return _PICKLE_BY_VALUE_MODULES.copy()
185+
186+
187+
def _is_registered_pickle_by_value(module):
188+
module_name = module.__name__
189+
if module_name in _PICKLE_BY_VALUE_MODULES:
190+
return True
191+
while True:
192+
parent_name = module_name.rsplit(".", 1)[0]
193+
if parent_name == module_name:
194+
break
195+
if parent_name in _PICKLE_BY_VALUE_MODULES:
196+
return True
197+
module_name = parent_name
198+
return False
199+
200+
127201
def _whichmodule(obj, name):
128202
"""Find the module an object belongs to.
129203
@@ -170,18 +244,35 @@ def _whichmodule(obj, name):
170244
return None
171245

172246

173-
def _is_importable(obj, name=None):
174-
"""Dispatcher utility to test the importability of various constructs."""
175-
if isinstance(obj, types.FunctionType):
176-
return _lookup_module_and_qualname(obj, name=name) is not None
177-
elif issubclass(type(obj), type):
178-
return _lookup_module_and_qualname(obj, name=name) is not None
247+
def _should_pickle_by_reference(obj, name=None):
248+
"""Test whether an function or a class should be pickled by reference
249+
250+
Pickling by reference means by that the object (typically a function or a
251+
class) is an attribute of a module that is assumed to be importable in the
252+
target Python environment. Loading will therefore rely on importing the
253+
module and then calling `getattr` on it to access the function or class.
254+
255+
Pickling by reference is the only option to pickle functions and classes
256+
in the standard library. In cloudpickle the alternative option is to
257+
pickle by value (for instance for interactively or locally defined
258+
functions and classes or for attributes of modules that have been
259+
explicitly registered to be pickled by value.
260+
"""
261+
if isinstance(obj, types.FunctionType) or issubclass(type(obj), type):
262+
module_and_name = _lookup_module_and_qualname(obj, name=name)
263+
if module_and_name is None:
264+
return False
265+
module, name = module_and_name
266+
return not _is_registered_pickle_by_value(module)
267+
179268
elif isinstance(obj, types.ModuleType):
180269
# We assume that sys.modules is primarily used as a cache mechanism for
181270
# the Python import machinery. Checking if a module has been added in
182-
# is sys.modules therefore a cheap and simple heuristic to tell us whether
183-
# we can assume that a given module could be imported by name in
184-
# another Python process.
271+
# is sys.modules therefore a cheap and simple heuristic to tell us
272+
# whether we can assume that a given module could be imported by name
273+
# in another Python process.
274+
if _is_registered_pickle_by_value(obj):
275+
return False
185276
return obj.__name__ in sys.modules
186277
else:
187278
raise TypeError(
@@ -839,10 +930,15 @@ def _decompose_typevar(obj):
839930

840931

841932
def _typevar_reduce(obj):
842-
# TypeVar instances have no __qualname__ hence we pass the name explicitly.
933+
# TypeVar instances require the module information hence why we
934+
# are not using the _should_pickle_by_reference directly
843935
module_and_name = _lookup_module_and_qualname(obj, name=obj.__name__)
936+
844937
if module_and_name is None:
845938
return (_make_typevar, _decompose_typevar(obj))
939+
elif _is_registered_pickle_by_value(module_and_name[0]):
940+
return (_make_typevar, _decompose_typevar(obj))
941+
846942
return (getattr, module_and_name)
847943

848944

Diff for: cloudpickle/cloudpickle_fast.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
from .compat import pickle, Pickler
2929
from .cloudpickle import (
3030
_extract_code_globals, _BUILTIN_TYPE_NAMES, DEFAULT_PROTOCOL,
31-
_find_imported_submodules, _get_cell_contents, _is_importable,
31+
_find_imported_submodules, _get_cell_contents, _should_pickle_by_reference,
3232
_builtin_type, _get_or_create_tracker_id, _make_skeleton_class,
3333
_make_skeleton_enum, _extract_class_dict, dynamic_subimport, subimport,
3434
_typevar_reduce, _get_bases, _make_cell, _make_empty_cell, CellType,
@@ -352,7 +352,7 @@ def _memoryview_reduce(obj):
352352

353353

354354
def _module_reduce(obj):
355-
if _is_importable(obj):
355+
if _should_pickle_by_reference(obj):
356356
return subimport, (obj.__name__,)
357357
else:
358358
# Some external libraries can populate the "__builtins__" entry of a
@@ -414,7 +414,7 @@ def _class_reduce(obj):
414414
return type, (NotImplemented,)
415415
elif obj in _BUILTIN_TYPE_NAMES:
416416
return _builtin_type, (_BUILTIN_TYPE_NAMES[obj],)
417-
elif not _is_importable(obj):
417+
elif not _should_pickle_by_reference(obj):
418418
return _dynamic_class_reduce(obj)
419419
return NotImplemented
420420

@@ -559,7 +559,7 @@ def _function_reduce(self, obj):
559559
As opposed to cloudpickle.py, There no special handling for builtin
560560
pypy functions because cloudpickle_fast is CPython-specific.
561561
"""
562-
if _is_importable(obj):
562+
if _should_pickle_by_reference(obj):
563563
return NotImplemented
564564
else:
565565
return self._dynamic_function_reduce(obj)
@@ -763,7 +763,7 @@ def save_global(self, obj, name=None, pack=struct.pack):
763763
)
764764
elif name is not None:
765765
Pickler.save_global(self, obj, name=name)
766-
elif not _is_importable(obj, name=name):
766+
elif not _should_pickle_by_reference(obj, name=name):
767767
self._save_reduce_pickle5(*_dynamic_class_reduce(obj), obj=obj)
768768
else:
769769
Pickler.save_global(self, obj, name=name)
@@ -775,7 +775,7 @@ def save_function(self, obj, name=None):
775775
Determines what kind of function obj is (e.g. lambda, defined at
776776
interactive prompt, etc) and handles the pickling appropriately.
777777
"""
778-
if _is_importable(obj, name=name):
778+
if _should_pickle_by_reference(obj, name=name):
779779
return Pickler.save_global(self, obj, name=name)
780780
elif PYPY and isinstance(obj.__code__, builtin_code_type):
781781
return self.save_pypy_builtin_func(obj)

0 commit comments

Comments
 (0)