-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for sub-interpreters #5564
base: master
Are you sure you want to change the base?
Conversation
59a2076
to
02e9609
Compare
236cc25
to
75d55f3
Compare
df2fcc6
to
e64d19f
Compare
include/pybind11/detail/internals.h
Outdated
@@ -260,7 +260,25 @@ struct type_info { | |||
/// Each module locally stores a pointer to the `internals` data. The data | |||
/// itself is shared among modules with the same `PYBIND11_INTERNALS_ID`. | |||
inline internals **&get_internals_pp() { | |||
#if defined(PYPY_VERSION) || defined(GRAALVM_PYTHON) || PY_VERSION_HEX < 0x030C0000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Emscripten/WASI? I'm assuming iOS and Android would behave like normal CPython, but does Wasm support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with non-CPython implementations, but it looks like Pyodide has some support for subinterpreters at least.
Very excited to see this! I have a couple of comments/questions:
We'll need some docs, too. Maybe we should do a full test run with the define on, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complex PR. I need to find more time later for a full review.
High-level questions:
- Could it be useful to split this PR: 1. multi-phase init only. 2. multi-interpreter support? — That would make it easier to do the reviews now, and understand the development steps in the future. It might also help us dealing with bugs after this change is released.
- Are you still working on a new Python tests, to exercise the new multi-interpreter support?
- Is there a potential for bug or feature interference between free-threading and multi-interpreter functionality? — I think we'll need tests for all combinations of (free-threading on/off) x (multi-interpreter support on/off); not for all platforms, but maybe one each: Linux, macOS, Windows.
@@ -427,9 +445,6 @@ PYBIND11_NOINLINE internals &get_internals() { | |||
return **internals_pp; | |||
} | |||
|
|||
#if defined(PYBIND11_SIMPLE_GIL_MANAGEMENT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here on purpose, with the idea that we don't overlook this when we make changes in or around pybind11/gil.h. I don't see changes in gil.h. Could we keep this as-is?
Sure. I have created #5574 for multi-phase init only. I will keep updating this PR for sub-interpreter support, and will remove the multiphase init from this branch shortly.
I'm not currently, but I can add a few more after/along with some additional changes from comments.
No one would ever say less testing :) I think the potential problems are small. While they have similar goals, free threading and own-gil-sub-interpreters are fairly different and could be used together. Sub-interpreters were originally created (I think) to offer sandboxing features. So even with free threading, the idea that a module is used in two different sandboxes is still valid, and it still needs separation for each instance. Also I considered whether free-threading is a superset of sub-interpreter support. I think it is not, for the same reason that the two have slightly different implications for a module. With free threading (only) it is perfectly reasonable to have, for example, a global/static atomic variable. With sub-interpreters that is probably incorrect, because each sub-interpreter should have its own separate state. However, if a module is free-threading safe (so, thread safe) and it is multi-interpreter aware (as no global state) then it should also be own-gil-sub-interpreter safe... that is, it doesn't need GIL synchronization across the many multiple subinterpreter states, which must be true or it would not be free-threading safe. |
My 2cts: TLS in shared libraries is real disaster (especially C++ So adding more TLS to the internals data structure in general sounds like a pretty significant performance sink. I would encourage you to thoroughly benchmark function/method calling on Windows/macOS/Linux to see how bad this is, and to what extent these costs can be mitigated. |
FYI, this is a perfect example of where I'd personally always rebase and force push. ;) |
Since we can avoid them by checking this atomic, the cmake config conditional shouldn't be necessary. The slower path (with thread_locals and extra checks) only comes in when a second interpreter is actually instanciated.
Uses two extra threads to demonstrate that neither shares a GIL.
Luckily, these are all zero-initialized pointers, so no constructors. And I think multiple interpreters in multiple threads a the same time might qualify as a complex case. Still, point taken, on this expert advice I have made a bunch of changes to get the Unfortunately, in the multi-interpreter case there really isn't any choice but to use |
Looks like, with the rewrite, the cost is an extra 0.22ns per call to get_details (an increase of about 15%). IMO that is a very small cost, unlikely any real-world usage will notice it. The cost is much more significant when a subinterpreter is actually created, the cost to access the internals triples. In my opinion, that's just the cost of using subinterpreters. ... I've added this information to the PR description. |
// that do not have subinterpreters. Nothing breaks if this is defined but the impl does not | ||
// actually support subinterpreters. | ||
#if PY_VERSION_HEX >= 0x030C0000 && !defined(PYPY_VERSION) && !defined(GRAALVM_PYTHON) | ||
# define PYBIND11_SUBINTERPRETER_SUPPORT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm ... what are the pros-and-cons of forced-in (what you have here) vs opt-in (my suggestion from a few days ago)?
How widely used and mature is the subinterpreter feature in general?
Also, is there someone you know who might be available to fully review this PR? It looks like a lot of work, at least for me. I don't know anything specific about subinterpreters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial version of this PR had some potential performance problems, but they're eliminated now IMO so it doesn't seem necessary to complicate the feature with cmake options and compile defines. These preprocessor lines just consolidate the different circumstances where sub-interpreters are not supported into one place... could be removed as the resulting define here is only used in 2 other places.
Sub-interpreters have been around for a long time but not widely used. Python 3.12 introduced per-interpreter GIL support into the stable API (PEP 684). It is stable in 3.12, while free threading is still experimental in 3.13 and requires opt-in flags in CPython. If someone is writing "production" code in 2025 they should not be using free threading, but they could consider using sub-interpreters with per-interpreter GIL to get some benefits of parallelism.
I don't know anyone else who specifically knows a lot about sub-interpreters. The main things to know about working with sub-interpreters:
- Python objects (except immortal objects) should never be mixed between sub-interpreters. There is an API (similar to multiprocessing) that can allow limited communication between sub-interpreters.
- Native code modules need to be careful about global state. If global state needs to exist, it needs to be protected by locks (as in free threading). But most of the global state should actually be per-interpreter state. (This is what this PR is really about,
internals
andlocal_internals
are global state which really need to be per-interpreter state.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be removed as the resulting define here is only used in 2 other places.
I feel strongly it should stay here, for clarity, and future maintainability (including refactoring).
I'm still on the fence, force-in vs opt-in. @henryiii do you have an opinion?
Hi, I wonder if you have read these:
To support sub-interpreters, I think we also need to implement PEP 573 and PEP 630. To clarify:
I think it would be really hard to make |
The goal of these is to get rid of global state, and replace it with state that is tied to the module instance. While implementing this according to the python guides will definitely accomplish that goal -- and that might be the best way to do it --, that is not the only way to accomplish it. I think strict adherence to these would require major rewrites of several parts of pybind11, but I don't think that is necessary to support sub-interpreters/multi-interpreters. Definitely following PEP573 would make a module work with sub-interpreters. It is a sufficient condition but not a necessary condition. Pybind11's global state is entirely contained within
I don't think this is required, since Pybind11's types are managed by its internals structures, they already are not globally static in the strict sense. Converting them to use Type_Spec and Slots is IMO unrelated to this PR. (Edit: or, maybe pybind11 is already doing heap types? At least, some of them are...)
I agree that is probably impossible. My goal here isn't full module isolation, pybind11 already doesn't have module isolation and it can't be added. That doesn't mean it can't support sub-interpreters. Maybe another way to think about it is that this PR adds interpreter isolation without adding module isolation. The examples you linked explain the kinds of problems that non-isolated modules have, which existing pybind11 modules all have, and they would continue to have after this PR. |
I've run @wjakob's benchmarks for nanobind on this PR. The binary size might be a bit off, since I can't run strip due to needing undefined symbols, but I don't think that has any affect here anyway. I couldn't get nanobind to load, so I had to take it off the runtime plot. This seems to have a noticeable impact on debug (unoptimized) performance, but not really noticeable on runtime, probably within the uncertainty margins. I'd love for the runtime cost to go down (there's an old PR that did that, but not usable anymore) instead of up, but this looks acceptable to avoid complications building. |
@@ -18,6 +18,7 @@ | |||
#include <pybind11/conduit/pybind11_platform_abi_id.h> | |||
#include <pybind11/pytypes.h> | |||
|
|||
#include <atomic> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self since I checked: we are already using atomic in object.h (using atomic requires linking to libatomic on some platforms, like armv7l). I think we are actually missing that currently (https://github.com/scikit-hep/boost-histogram/blob/38ae735c07a9bbbbc80ca5ad9b57af106f61ef43/CMakeLists.txt#L91-L93 for example), but this isn't a new issue.
Description
This PR add the ability for pybind11 modules to support subinterpreters. This support requires 2 things:
internals
andlocal_internals
have to have an instance per-interpreter (can no longer be static singletons), which is (now) the primary subject of this PRMultiphase init
The PR adds
mod_per_interpreter_gil
andmod_multi_interpreter_one_gil
tags which can be passed as the 3rd argument to the macro (in addition to the existingmod_gil_not_used
). If neither is specified, the module continues to do multiphase init but reportsPy_MOD_MULTIPLE_INTERPRETERS_NOT_SUPPORTED
.When a module is imported a second time in a sub-interpreter, the module's
exec
slot is run again. For pybind11 this means the user's the module init function is re-run in the sub-interpreter. That's good, because the sub-interpreter needs it's own type_info for all of the bindings.internals
This presents the problem that the place that pybind11 stores these is currently a singleton. But sub-interpreters need this state to be per-interpreter. That means the proper instance of these (for the current active interpreter) needs to be retrieved from the interpreter state dict. Fortunately, the internals pointer-to-pointer is already stored in the state dict.
In order to minimize performance costs, we can detect whether or not multi-subinterpreters are present by counting how many times the module has been imported. If it has only been imported once then it can only possibly have one
internals
(even if there are other subinterpreters where it was not imported). When it has been imported more times, then we need to do additional work to make sure the right internals object is used (the one associated with the current interpreter in the current thread). We can switch between these two cases with a single simple branch, thus causing minimal performance overhead for existing code.In the multi-interpreter case we would still like to minimized the cost of accessing internals, we don't want to have to reach into the interpreter state dict every time. So we cache the value in a
thread_local
along with the pointer to thePyThreadState
to which it belongs. This means that the slow path (acquiring the GIL, doing a dict lookup, etc) is only done when the activePyThreadState
changes (or the first timeget_internals
is called in an OS thread). So the fast path merely checks that the PyThreadState hasn't changed, and then returns the previously looked up value.local_internals
local_internals
is a slightly larger change. It was also global singleton and was not stored in the interpreter state dict at all. It has changed to be much more like theinternals
code, and both have been refactored a little bit to share some of that code.local_internals
needs to be per-module-per-interpreter (unlike internals), so we have to formulate a unique key for the state dict. Other than the key, the two now work almost identically.Performance
On the current version of master,
detail::get_internals()
takes about 1.7ns per call on my machine.On this PR without multiple subinterpreters present,
detail::get_internals()
takes about 1.95ns per call on my machine. (About 15% slower)On this PR with multiple subinterpreters present,
detail::get_internals()
takes about 5.13ns per call on my machine. (About 300% slower).So multiple subinterpreters does definitely introduce a cost when the feature is used, but merely supporting the use has only a small additional cost. The 15% additional cost of this very small function is unlikely to be noticeable in any meaningful program.
Memory management / Future work
This PR does not add support for creating / deleting / switching between sub-interpreters.
In embed, pybind11 only cleans up the internals and local_interals associated with the main interpreter (when it is finalized). SInce it doesn't currently manage any subinterpreters it can't clean up after them.
Suggested changelog entry:
The guide needs to add a short mention of
py::mod_multi_interpreter_one_gil()
andpy::mod_per_interpreter_gil()
tags.