Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pyvenv style virtual environments easier to configure when embedding Python #66409

Open
grahamd mannequin opened this issue Aug 17, 2014 · 66 comments
Open

Make pyvenv style virtual environments easier to configure when embedding Python #66409

grahamd mannequin opened this issue Aug 17, 2014 · 66 comments
Labels
docs Documentation in the Doc dir tests Tests in the Lib/test dir topic-venv Related to the venv module type-feature A feature request or enhancement

Comments

@grahamd
Copy link
Mannequin

grahamd mannequin commented Aug 17, 2014

BPO 22213
Nosy @ncoghlan, @pitrou, @vstinner, @methane, @ericsnowcurrently, @zooba, @ndjensen, @LeslieGerman, @M-Kerr, @abrunner73
Dependencies
  • bpo-22257: PEP 432 (PEP 587): Redesign the interpreter startup sequence
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2014-08-17.11:42:41.318>
    labels = ['type-feature', '3.8']
    title = 'Make pyvenv style virtual environments easier to configure when embedding Python'
    updated_at = <Date 2021-12-08.21:26:34.716>
    user = 'https://bugs.python.org/grahamd'

    bugs.python.org fields:

    activity = <Date 2021-12-08.21:26:34.716>
    actor = 'ndjensen'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2014-08-17.11:42:41.318>
    creator = 'grahamd'
    dependencies = ['22257']
    files = []
    hgrepos = []
    issue_num = 22213
    keywords = []
    message_count = 31.0
    messages = ['225434', '225436', '225437', '225739', '225742', '225771', '225774', '225890', '334926', '334948', '335015', '335468', '335470', '335479', '335484', '335648', '335650', '335688', '335692', '335749', '336793', '343636', '352905', '354856', '354857', '354858', '361600', '361869', '362260', '366570', '384496']
    nosy_count = 13.0
    nosy_names = ['ncoghlan', 'pitrou', 'vstinner', 'pyscripter', 'grahamd', 'methane', 'eric.snow', 'steve.dower', 'Henning.von.Bargen', 'ndjensen', 'Leslie', 'M.Kerr', 'abrunner73']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22213'
    versions = ['Python 3.8']

    Linked PRs

    @grahamd
    Copy link
    Mannequin Author

    grahamd mannequin commented Aug 17, 2014

    In am embedded system, as the 'python' executable is itself not run and the Python interpreter is initialised in process explicitly using PyInitialize(), in order to find the location of the Python installation, an elaborate sequence of checks is run as implemented in calculate_path() of Modules/getpath.c.

    The primary mechanism is usually to search for a 'python' executable on PATH and use that as a starting point. From that it then back tracks up the file system from the bin directory to arrive at what would be the perceived equivalent of PYTHONHOME. The lib/pythonX.Y directory under that for the matching version X.Y of Python being initialised would then be used.

    Problems can often occur with the way this search is done though.

    For example, if someone is not using the system Python installation but has installed a different version of Python under /usr/local. At run time, the correct Python shared library would be getting loaded from /usr/local/lib, but because the 'python' executable is found from /usr/bin, it uses /usr as sys.prefix instead of /usr/local.

    This can cause two distinct problems.

    The first is that there is no Python installation at all under /usr corresponding to the Python version which was embedded, with the result of it not being able to import 'site' module and therefore failing.

    The second is that there is a Python installation of the same major/minor but potentially a different patch revision, or compiled with different binary API flags or different Unicode character width. The Python interpreter in this case may well be able to start up, but the mismatch in the Python modules or extension modules and the core Python library that was actually linked can cause odd errors or crashes to occur.

    Anyway, that is the background.

    For an embedded system the way this problem was overcome was for it to use Py_SetPythonHome() to forcibly override what should be used for PYTHONHOME so that the correct installation was found and used at runtime.

    Now this would work quite happily even for Python virtual environments constructed using 'virtualenv' allowing the embedded system to be run in that separate virtual environment distinct from the main Python installation it was created from.

    Although this works for Python virtual environments created using 'virtualenv', it doesn't work if the virtual environment was created using pyvenv.

    One can easily illustrate the problem without even using an embedded system.

    $ which python3.4
    /Library/Frameworks/Python.framework/Versions/3.4/bin/python3.4
    
    $ pyvenv-3.4 py34-pyvenv
    
    $ py34-pyvenv/bin/python
    Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.prefix
    '/private/tmp/py34-pyvenv'
    >>> sys.path
    ['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload', '/private/tmp/py34-pyvenv/lib/python3.4/site-packages']
    
    $ PYTHONHOME=/tmp/py34-pyvenv python3.4
    Fatal Python error: Py_Initialize: unable to load the file system codec
    ImportError: No module named 'encodings'
    Abort trap: 6

    The basic problem is that in a pyvenv virtual environment, there is no duplication of stuff in lib/pythonX.Y, with the only thing in there being the site-packages directory.

    When you start up the 'python' executable direct from the pyvenv virtual environment, the startup sequence checks know this and consult the pyvenv.cfg to extract the:

    home = /Library/Frameworks/Python.framework/Versions/3.4/bin

    setting and from that derive where the actual run time files are.

    When PYTHONHOME or Py_SetPythonHome() is used, then the getpath.c checks blindly believe that is the authoritative value:

    • Step 2. See if the $PYTHONHOME environment variable points to the
    • installed location of the Python libraries. If $PYTHONHOME is set, then
    • it points to prefix and exec_prefix. $PYTHONHOME can be a single
    • directory, which is used for both, or the prefix and exec_prefix
    • directories separated by a colon.
        /* If PYTHONHOME is set, we believe it unconditionally */
        if (home) {
            wchar_t *delim;
            wcsncpy(prefix, home, MAXPATHLEN);
            prefix[MAXPATHLEN] = L'\0';
            delim = wcschr(prefix, DELIM);
            if (delim)
                *delim = L'\0';
            joinpath(prefix, lib_python);
            joinpath(prefix, LANDMARK);
            return 1;
        }
    Because of this, the problem above occurs as the proper runtime directories for files aren't included in sys.path. The result being that the 'encodings' module cannot even be found.

    What I believe should occur is that PYTHONHOME should not be believed unconditionally. Instead there should be a check to see if that directory contains a pyvenv.cfg file and if there is one, realise it is a pyvenv style virtual environment and do the same sort of adjustments which would be made based on looking at what that pyvenv.cfg file contains.

    For the record this issue is affecting Apache/mod_wsgi and right now the only workaround I have is to tell people that in addition to setting the configuration setting corresponding to PYTHONHOME, to use configuration settings to have the same effect as doing:

    PYTHONPATH=/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload

    so that the correct runtime files are found.

    I am still trying to work out a more permanent workaround I can add to mod_wsgi code itself since can't rely on a fix for existing Python versions with pyvenv support.

    Only other option is to tell people not to use pyvenv and use virtualenv instead.

    Right now I can offer no actual patch as that getpath.c code is scary enough that not even sure at this point where the check should be incorporated or how.

    Only thing I can surmise is that the current check for pyvenv.cfg being before the search for the prefix is meaning that it isn't consulted.

    /* Search for an environment configuration file, first in the
       executable's directory and then in the parent directory.
       If found, open it for use when searching for prefixes.
    */
    
    {
        wchar_t tmpbuffer[MAXPATHLEN+1];
        wchar_t *env_cfg = L"pyvenv.cfg";
        FILE * env_file = NULL;
    
        wcscpy(tmpbuffer, argv0_path);
    
            joinpath(tmpbuffer, env_cfg);
            env_file = _Py_wfopen(tmpbuffer, L"r");
            if (env_file == NULL) {
                errno = 0;
                reduce(tmpbuffer);
                reduce(tmpbuffer);
                joinpath(tmpbuffer, env_cfg);
                env_file = _Py_wfopen(tmpbuffer, L"r");
                if (env_file == NULL) {
                    errno = 0;
                }
            }
            if (env_file != NULL) {
                /* Look for a 'home' variable and set argv0_path to it, if found */
                if (find_env_config_value(env_file, L"home", tmpbuffer)) {
                    wcscpy(argv0_path, tmpbuffer);
                }
                fclose(env_file);
                env_file = NULL;
            }
        }
        pfound = search_for_prefix(argv0_path, home, _prefix, lib_python);

    @ncoghlan
    Copy link
    Contributor

    Yeah, PEP-432 (my proposal to redesign the startup sequence) could just as well be subtitled "getpath.c hurts my brain" :P

    One tricky part here is going to be figuring out how to test this - perhaps adding a new test option to _testembed and then running it both inside and outside a venv.

    @ncoghlan
    Copy link
    Contributor

    Graham pointed out that setting PYTHONHOME ends up triggering the same control flow through getpath.c as calling Py_SetPythonHome, so this can be tested just with pyvenv and a suitably configured environment.

    It may still be a little tricky though, since we normally run the pyvenv tests in isolated mode to avoid spurious failures due to bad environment settings...

    @ncoghlan
    Copy link
    Contributor

    Some more experiments, comparing an installed vs uninstalled Python. One failure mode is that setting PYTHONHOME just plain breaks running from a source checkout (setting PYTHONHOME to the checkout directory also fails):

    $ ./python -m venv --without-pip /tmp/issue22213-py35
    
    $ /tmp/issue22213-py35/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/usr/local /tmp/issue22213-py35/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)

    Trying after running "make altinstall" (which I had previously done for 3.4) is a bit more enlightening:

    $ python3.4 -m venv --without-pip /tmp/issue22213-py34
    
    $ /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/usr/local /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/tmp/issue22213-py34 /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)
    
    $ PYTHONHOME=/tmp/issue22213-py34:/usr/local /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)
    [ncoghlan@lancre py34]$ PYTHONHOME=/usr/local:/tmp/issue22213-py34/bin /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /tmp/issue22213-py34/bin

    I think what this is actually showing is that there's a fundamental conflict between mod_wsgi's expectation of being able to set PYTHONHOME to point to the virtual environment, and the way PEP-405 virtual environments actually work.

    With PEP-405, all the operations in getpath.c expect to execute while pointing to the *base* environment: where the standard library lives. It is then up to site.py to later adjust the based prefix location, as can be demonstrated by the fact pyvenv.cfg isn't processed if processing the site module is disabled:

    $ /tmp/issue22213-py34/bin/python -c "import sys; print(sys.prefix, sys.exec_prefix)"
    /tmp/issue22213-py34 /tmp/issue22213-py34
    $ /tmp/issue22213-py34/bin/python -S -c "import sys; print(sys.prefix, sys.exec_prefix)"
    /usr/local /usr/local

    At this point in time, there isn't an easy way for an embedding application to say "here's the standard library, here's the virtual environment with user packages" - it's necessary to just override the path calculations entirely.

    Allowing that kind of more granular configuration is one of the design goals of PEP-432, so adding that as a dependency here.

    @grahamd
    Copy link
    Mannequin Author

    grahamd mannequin commented Aug 23, 2014

    It is actually very easy for me to work around and I released a new mod_wsgi version today which works.

    When I get a Python home option, instead of calling Py_SetPythonHome() with it, I append '/bin/python' to it and call Py_SetProgramName() instead.

    @ncoghlan
    Copy link
    Contributor

    Excellent! If I recall correctly, that works because we resolve the symlink when looking for the standard library, but not when looking for venv configuration file.

    I also suspect this is all thoroughly broken on Windows - there are so many configuration operations and platform specific considerations that need to be accounted for in getpath.c these days that it has become close to incomprehensible :(

    One of my main goals with PEP-432 is actually to make it possible to rewrite the path configuration code in a more maintainable way - my unofficial subtitle for that PEP is "getpath.c must die!" :)

    @ncoghlan ncoghlan changed the title pyvenv style virtual environments unusable in an embedded system Make pyvenv style virtual environments easier to configure when embedding Python Aug 24, 2014
    @ncoghlan ncoghlan added the type-feature A feature request or enhancement label Aug 24, 2014
    @grahamd
    Copy link
    Mannequin Author

    grahamd mannequin commented Aug 24, 2014

    I only make the change to Py_SetProgramName() on UNIX and not Windows. This is because back in mod_wsgi 1.0 I did actually used to use Py_SetProgramName() but it didn't seem to work in sane way on Windows so changed to Py_SetPythonHome() which worked on both Windows and UNIX. Latest versions of mod_wsgi haven't been updated yet to even build on Windows, so not caring about Windows right now.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 25, 2014

    That workaround would definitely deserve being wrapped in a higher-level API invokable by embedding applications, IMHO.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Feb 6, 2019

    (Added Victor, Eric, and Steve to the nosy list here, as I'd actually forgotten about this until issue bpo-35706 reminded me)

    Core of the problem: the embedding APIs don't currently offer a Windows-compatible way of setting up "use this base Python and this venv site-packages", and the way of getting it to work on other platforms is pretty obscure.

    @zooba
    Copy link
    Member

    zooba commented Feb 6, 2019

    Victor may be thinking about it from time to time (or perhaps it's time to make the rest of the configuration changes plans concrete so we can all help out?), but I'd like to see this as either:

    • a helper function to fill out the core config structure from a pyvenv.cfg file (rather than hiding it deeper as it currently is), or better yet,
    • remove the dependency on all non-frozen imports at initialization and let embedders define Python code to do the initialization

    In the latter case, the main python.exe also gets to define its behavior. So for the most part, we should be able to remove getpath[p].c and move it into the site module, then make that our Python initialization step.

    This would also mean that if you are embedding Python but not allowing imports (e.g. as only a calculation engine), you don't have to do the dance of _denying_ all lookups, you simply don't initialize them.

    But as far as I know, we don't have a concrete vision for "how will consumers embed Python in their apps" that can translate into work - we're still all individually pulling in slightly different directions. Sorting that out is most important - having someone willing to do the customer engagement work to define an actual set of requirements and desirables would be fantastic.

    @zooba zooba added the 3.8 (EOL) end of life label Feb 6, 2019
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Feb 7, 2019

    Yeah, I mainly cc'ed Victor and Eric since making this easier ties into one of the original design goals for PEP-432 (even though I haven't managed to persuade either of them to become co-authors of that PEP yet).

    @vstinner
    Copy link
    Member

    PEP-432 will allow to give with fine control on parameters used to initialize Python. Sadly, I failed to agree with Nick Coghlan and Eric Snow on the API. The current implementation (_PyCoreConfig and _PyMainInterpreterConfig) has some flaw (don't separate clearly the early initialization and Unicode-ready state, the interpreter contains main and core config whereas some options are duplicated in both configs, etc.).

    See also bpo-35706.

    @zooba
    Copy link
    Member

    zooba commented Feb 13, 2019

    I just closed 35706 as a duplicate of this one (the titles are basically identical, which feels like a good hint ;) )

    It seems that the disagreement about the design is fundamentally a disagreement between a "quick, painful but complete fix" and "slow, careful improvements with a transition period". Both are valid approaches, and since Victor is putting actual effort in right now he gets to "win", but I do think we can afford to move faster.

    It seems the main people who will suffer from the pain here are embedders (who are already suffering pain) and the core developers (who explicitly signed up for pain!). But without knowing the end goal, we can't accelerate.

    Currently PEP-432 is the best description we have, and it looks like Victor has been heading in that direction too (deliberately? I don't know :) ). But it seems like a good time to review it, replace the "here's the current state of things" with "here's an imaginary ideal state of things" and fill the rest with "here are the steps to get there without breaking the world".

    By necessity, it touches a lot of people's contributions to Python, but it also has the potential to seriously improve even more people's ability to _use_ Python (for example, I know an app that you all would recognize the name of who is working on embedding Python right now and would _love_ certain parts of this side of things to be improved).

    Nick - has the steering council been thinking about ways to promote collaborative development of ideas like this? I'm thinking an Etherpad style environment for the brainstorm period (in lieu of an in-person whiteboard session) that's easy for us all to add our concerns to, that can then be turned into something more formal.

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    @vstinner
    Copy link
    Member

    It seems that the disagreement about the design is fundamentally a disagreement between a "quick, painful but complete fix" and "slow, careful improvements with a transition period". Both are valid approaches, and since Victor is putting actual effort in right now he gets to "win", but I do think we can afford to move faster.

    Technically, the API already exists and is exposed as a private API:

    • "_PyCoreConfig" structure
    • "_PyInitError _Py_InitializeFromConfig(const _PyCoreConfig *config)" function
    • "void _Py_FatalInitError(_PyInitError err)" function (should be called on failure)

    I'm not really sure of the benefit compared to the current initialization API using Py_xxx global configuration variables (ex: Py_IgnoreEnvironmentFlag) and Py_Initialize().

    _PyCoreConfig basically exposed *all* input parameters used to initialize Python, much more than jsut global configuration variables and the few function that can be called before Py_Initialize():
    https://docs.python.org/dev/c-api/init.html

    Currently PEP-432 is the best description we have, and it looks like Victor has been heading in that direction too (deliberately? I don't know :) ).

    Well, it's a strange story. At the beginning, I had a very simple use case... it took me more or less one year to implement it :-) My use case was to add... a new -X utf8 command line option:

    • parsing the command line requires to decode bytes using an encoding
    • the encoding depends on the locale, environment variable and options on the command line
    • environment variables depend on the command line (-E option)

    If the utf8 mode is enabled (PEP-540), the encoding must be set to UTF-8, all configuration must be removed and the whole configuration (env vars, cmdline, etc.) must be read again from scratch :-)

    To be able to do that, I had to collect *every single* thing which has an impact on the Python initialization: all things that I moved into _PyCoreConfig.

    ... but I didn't want to break the backward compatibility, so I had to keep support for Py_xxx global configuration variables... and also the few initialization functions like Py_SetPath() or Py_SetStandardStreamEncoding().

    Later it becomes very dark, my goal became very unclear and I looked at the PEP-432 :-)

    Well, I wanted to expose _PyCoreConfig somehow, so I looked at the PEP-432 to see how it can be exposed.

    By necessity, it touches a lot of people's contributions to Python, but it also has the potential to seriously improve even more people's ability to _use_ Python (for example, I know an app that you all would recognize the name of who is working on embedding Python right now and would _love_ certain parts of this side of things to be improved).

    _PyCoreConfig "API" makes some things way simpler. Maybe it was already possible to do them previously but it was really hard, or maybe it was just not possible.

    If a _PyCoreConfig field is set: it has the priority over any other way to initialize the field. _PyCoreConfig has the highest prioririty.

    For example, _PyCoreConfig allows to completely ignore the code which computes sys.path (and related variables) by setting directly the "path configuration":

    • nmodule_search_path, module_search_paths: list of sys.path paths
    • executable: sys.executable */
    • prefix: sys.prefix
    • base_prefix: sys.base_prefix
    • exec_prefix: sys.exec_prefix
    • base_exec_prefix sys.base_exec_prefix
    • (Windows only) dll_path: Windows DLL path

    The code which initializes these fields is really complex. Without _PyCoreConfig, it's hard to make sure that these fields are properly initialized as an embedder would like.

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    Sorry, I'm not sure of the API / structures, but when I discussed with Eric Snow at the latest sprint, we identified different steps in the Python initialization:

    • only use bytes (no encoding), no access to the filesystem (not needed at this point)
    • encoding defined, can use Unicode
    • use the filesystem
    • configuration converted as Python objects
    • Python is fully initialized

    --

    Once I experimented to reorganize _PyCoreConfig and _PyMainInterpreterConfig to avoid redundancy: add a _PyPreConfig which contains only fields which are needed before _PyMainInterpreterConfig. With that change, _PyMainInterpreterConfig (and _PyPreConfig) *contained* _PyCoreConfig.

    But it the change became very large, I wasn't sure that it was a good idea, I abandonned my change.

    --

    Ok, something else. _PyCoreConfig (and _PyMainInterpreterConfig) contain memory allocated on the heap. Problem: Python initialization changes the memory allocator. Code using _PyCoreConfig requires some "tricks" to ensure that the memory is *freed* with the same allocator used to *allocate* memory.

    I created bpo-35265 "Internal C API: pass the memory allocator in a context" to pass a "context" to a lot of functions, context which contains the memory allocator but can contain more things later.

    The idea of "a context" came during the discussion about a new C API: stop to rely on any global variable or "shared state", but *explicitly* pass a context to all functions. With that, it becomes possible to imagine to have two interpreters running in the same threads "at the same time".

    Honestly, I'm not really sure that it's fully possible to implement this idea... Python has *so many* "shared state", like *everywhere*. It's really a giant project to move these shared states into structures and pass pointers to these structures.

    So again, I abandonned my experimental change:
    #10574

    --

    Memory allocator, context, different structures for configuration... it's really not an easy topic :-( There are so many constraints put into a single API!

    The conservation option at this point is to keep the API private.

    ... Maybe we can explain how to use the private API but very explicitly warn that this API is experimental and can be broken anytime... And I plan to break it, to avoid redundancy between core and main configuration for example.

    ... I hope that these explanations give you a better idea of the big picture and the challenges :-)

    @zooba
    Copy link
    Member

    zooba commented Feb 14, 2019

    Thanks, Victor, that's great information.

    Memory allocator, context, different structures for configuration... it's really not an easy topic :-( There are so many constraints put into a single API!

    This is why I'm keen to design the ideal *user* API first (that is, write the examples of how you would use it) and then figure out how we can make it fit. It's kind of the opposite approach from what you've been doing to adapt the existing code to suit particular needs.

    For example, imagine instead of all the PySet*() functions followed by Py_Initialize() you could do this:

        PyObject *runtime = PyRuntime_Create();
        /* optional calls */
        PyRuntime_SetAllocators(runtime, &my_malloc, &my_realloc, &my_free);
        PyRuntime_SetHashSeed(runtime, 12345);
    /* sets this as the current runtime via a thread local */
    auto old_runtime = PyRuntime_Activate(runtime);
    assert(old_runtime == NULL)
    
    /* pretend triple quoting works in C for a minute ;) */
    const char *init = """
    import os.path
    import sys
    
        sys.executable = argv0
        sys.prefix = os.path.dirname(argv0)
        sys.path = [os.getcwd(), sys.prefix, os.path.join(sys.prefix, "Lib")]
    
        pyvenv = os.path.join(sys.prefix, "pyvenv.cfg")
        try:
            with open(pyvenv, "r", encoding="utf-8") as f:  # *only* utf-8 support at this stage
                for line in f:
                    if line.startswith("home"):
                        sys.path.append(line.partition("=")[2].strip())
                        break
        except FileNotFoundError:
            pass
    
        if sys.platform == "win32":
            sys.stdout = open("CONOUT$", "w", encoding="utf-8")
        else:
            # no idea if this is right, but you get the idea
            sys.stdout = open("/dev/tty", "w", encoding="utf-8")
        """;
        PyObject *globals = PyDict_New();
        /* only UTF-8 support at this stage */
        PyDict_SetItemString(globals, "argv0", PyUnicode_FromString(argv[0]));
        PyRuntime_Initialize(runtime, init_code, globals);
        Py_DECREF(globals);
    /* now we've initialised, loading codecs will succeed if we can find them or fail if not,
     * so we'd have to do cleanup to avoid depending on them without the user being able to
     * avoid it... */
    
        PyEval_EvalString("open('file.txt', 'w', encoding='gb18030').close()");
    
        /* may as well reuse DECREF for consistency */
        Py_DECREF(runtime);

    Maybe it's a terrible idea? Honestly I'd be inclined to do other big changes at the same time (make PyObject opaque and interface driven, for example).

    My point is that if the goal is to "move the existing internals around" then that's all we'll ever achieve. If we can say "the goal is to make this example work" then we'll be able to do much more.

    @ericsnowcurrently
    Copy link
    Member

    On Wed, Feb 13, 2019 at 10:56 AM Steve Dower <[email protected]> wrote:

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    Count me in. This is a pretty important topic and doing this would
    help accelerate our efforts by giving us a clearer common
    understanding and objective. FWIW, I plan on spending at least 5
    minutes of my 25 minute PyCon talk on our efforts to fix up the C-API,
    and this runtime initialization stuff is an important piece.

    @ericsnowcurrently
    Copy link
    Member

    On Wed, Feb 13, 2019 at 5:09 PM Steve Dower <[email protected]> wrote:

    This is why I'm keen to design the ideal *user* API first (that is, write the examples of how you would use it) and then figure out how we can make it fit.
    It's kind of the opposite approach from what you've been doing to adapt the existing code to suit particular needs.

    That makes sense. :)

    For example, imagine instead of all the PySet*() functions followed by Py_Initialize() you could do this:

    PyObject \*runtime = PyRuntime_Create();
    

    FYI, we already have a _PyRuntimeState struct (see
    Include/internal/pycore_pystate.h) which is where I pulled in a lot of
    the static globals last year. Now there is one process-global
    _PyRuntime (created in Python/pylifecycle.c) in place of all those
    globals. Note that _PyRuntimeState is in parallel with
    PyInterpreterState, so not a PyObject.

    /* optional calls \*/
    PyRuntime_SetAllocators(runtime, &my_malloc, &my_realloc, &my_free);
    PyRuntime_SetHashSeed(runtime, 12345);
    

    Note that one motivation behind PEP-432 (and its config structs) is to
    keep all the config together. Having the one struct means you always
    clearly see what your options are. Another motivation is to keep the
    config (dense with public fields) separate from the actual run state
    (opaque). Having a bunch of config functions (and global variables in
    the status quo) means a lot more surface area to deal with when
    embedding, as opposed to 2 config structs + a few initialization
    functions (and a couple of helpers) like in PEP-432.

    I don't know that you consciously intended to move away from the dense
    config struct route, so I figured I'd be clear. :)

    /* sets this as the current runtime via a thread local \*/
    auto old_runtime = PyRuntime_Activate(runtime);
    assert(old_runtime == NULL)
    

    Hmm, there are two ways we could go with this: keep using TLS (or
    static global in the case of _PyRuntime) or update the C-API to
    require explicitly passing the context (e.g. runtime, interp, tstate,
    or some wrapper) into all the functions that need it. Of course,
    changing that would definitely need some kind of compatibility shim to
    avoid requiring massive changes to every extension out there, which
    would mean effectively 2 C-APIs mirroring each other. So sticking
    with TLS is simpler. Personally, I'd prefer going the explicit
    argument route.

    /* pretend triple quoting works in C for a minute ;) \*/
    const char \*init_code = """
    

    [snip]
    """;

    PyObject \*globals = PyDict_New();
    /* only UTF-8 support at this stage \*/
    PyDict_SetItemString(globals, "argv0", PyUnicode_FromString(argv[0]));
    PyRuntime_Initialize(runtime, init_code, globals);
    

    Nice. I like that this keeps the init code right by where it's used,
    while also making it much more concise and easier to follow (since
    it's Python code).

    PyEval_EvalString("open('file.txt', 'w', encoding='gb18030').close()");
    

    I definitely like the approach of directly embedding the Python code
    like this. :) Are there any downsides?

    Maybe it's a terrible idea?

    Nah, we definitely want to maximize simplicity and your example offers
    a good shift in that direction. :)

    Honestly I'd be inclined to do other big changes at the same time (make PyObject opaque and interface driven, for example).

    Definitely! Those aren't big blockers on cleaning up initialization
    though, are they?

    My point is that if the goal is to "move the existing internals around" then that's all we'll ever achieve. If we can say "the goal is to make this example work" then we'll be able to do much more.

    Yep. I suppose part of the problem is that the embedding use cases
    aren't understood (or even recognized) well enough.

    @ncoghlan
    Copy link
    Contributor

    Steve, you're describing the goals of PEP-432 - design the desired API, then write the code to implement it. So while Victor's goal was specifically to get PEP-540 implemented, mine was just to make it so working on the startup sequence was less awful (and in particular, to make it possible to rewrite getpath.c in Python at some point).

    Unfortunately, it turns out that redesigning a going-on-thirty-year-old startup sequence takes a while, as we first have to discover what all the global settings actually *are* :)

    https://www.python.org/dev/peps/pep-0432/#invocation-of-phases describes an older iteration of the draft API design that was reasonably accurate at the point where Eric merged the in-development refactoring as a private API (see https://bugs.python.org/issue22257 and https://www.python.org/dev/peps/pep-0432/#implementation-strategy for details).

    However, that initial change was basically just a skeleton - we didn't migrate many of the settings over to the new system at that point (although we did successfully split the import system initialization into two parts, so you can enable builtin and frozen imports without necessarily enabling external ones).

    The significant contribution that Victor then made was to actually start migrating settings into the new structure, adapting it as needed based on the goals of PEP-540.

    Eric updated quite a few more internal APIs as he worked on improving the subinterpreter support.

    Between us, we also made a number of improvements to https://docs.python.org/3/c-api/init.html based on what we learned in the process of making those changes.

    So I'm completely open to changing the details of the API that PEP-432 is proposing, but the main lesson we've learned from what we've done so far is that CPython's long history of embedding support *does* constrain what we can do in practice, so it's necessary to account for feasibility of implementation when considering what we want to offer.

    Ideally, the next step would be to update PEP-432 with a status report on what was learned in the development of Python 3.7 with the new configuration structures, and what the internal startup APIs actually look like now. Even though I reviewed quite a few of Victor and Eric's PR, even I don't have a clear overall picture of where we are now, and I suspect Victor and Eric are in a similar situation.

    @ncoghlan
    Copy link
    Contributor

    Note also that Eric and I haven't failed to agree with Victor on an API, as Victor hasn't actually written a concrete proposal *for* a public API (neither as a PR updating PEP-432, nor as a separate PEP).

    The current implementation does NOT follow the PEP as written, because _Py_CoreConfig ended up with all the settings in it, when it's supposed to be just the bare minimum needed to get the interpreter to a point where it can run Python code that only accesses builtin and frozen modules.

    @ncoghlan
    Copy link
    Contributor

    Since I haven't really written them down anywhere else, noting some items I'm aware of from the Python 3.7 internals work that haven't made their way back into the PEP-432 public API proposal yet:

    • If we only had to care about the pure embedding case, this would be a lot easier. We don't though: we also care about "CPython interpreter variants" that end up calling Py_Main, and hence respect all the CPython environment variables, command line arguments, and in-process global variables. So what Victor ended up having to implement was data structs for all three of those configuration sources, and then helper functions to write them into the consolidated config structs (as well as writing them back to the in-process global variables).

    • Keeping the Py_Initialize and Py_Main APIs working mean that there are several API preconfiguration functions that need a way to auto-initialize the core runtime state with sensible defaults

    • the current private implementation uses the PyCoreConfig/PyMainInterpreterConfig naming scheme. Based on some of Eric's work, the PEP currently suggests PyRuntimeConfig PyMainInterpreterConfig, but I don't think any of us are especially in love with the latter name. Our inability to find a good name for it may also be a sign that it needs to be broken up into three distinct pieces (PySystemInterfaceConfig, PyCompilerConfig, PyMainModuleConfig)

    @vstinner
    Copy link
    Member

    I created bpo-36142: "Add a new _PyPreConfig step to Python initialization to setup memory allocator and encodings".

    @vstinner
    Copy link
    Member

    I wrote the PEP-587 "Python Initialization Configuration" which has been accepted. It allows to completely override the "Path Configuration". I'm not sure that it fully implementation what it requested here, but it should now be easier to tune the Path Configuration. See:
    https://www.python.org/dev/peps/pep-0587/#multi-phase-initialization-private-provisional-api

    I implemented the PEP-587 in bpo-36763.

    @pyscripter
    Copy link
    Mannequin

    pyscripter mannequin commented Sep 20, 2019

    To Victor:
    So how does the implementation of PEP-587 help configure embedded python with venv? It would be great help to provide some minimal instructions.

    @pyscripter
    Copy link
    Mannequin

    pyscripter mannequin commented Oct 17, 2019

    Just in case this will be of help to anyone, I found a way to use venvs in embedded python.

    • You first need to Initialize python that is referred as home in pyvenv.cfg.
    • Then you execute the following script:
    import sys
    sys.executable = r"Path to the python executable inside the venv"
    path = sys.path
    for i in range(len(path)-1, -1, -1):
        if path[i].find("site-packages") > 0:
            path.pop(i)
    import site
    site.main()
    del sys, path, i, site

    @zooba
    Copy link
    Member

    zooba commented Oct 17, 2019

    If you just want to be able to import modules from the venv, and you know the path to it, it's simpler to just do:

        import sys
        sys.path.append(r"path to venv\Lib\site-packages")

    Updating sys.executable is only necessary if you're going to use libraries that try to re-launch itself, but any embedding application is going to have to do that anyway.

    @pyscripter
    Copy link
    Mannequin

    pyscripter mannequin commented Oct 17, 2019

    To Steve:

    I want the embedded venv to have the same sys.path as if you were running the venv python interpreter. So my method takes into account for instance the include-system-site-packages option in pyvenv.cfg. Also my method sets sys.prefix in the same way as the venv python interpreter.

    @LeslieGerman
    Copy link
    Mannequin

    LeslieGerman mannequin commented Feb 7, 2020

    I just can say that sorting this issue (and PEP-0432) out would be great!
    I run into this issue when embedding CPython in a Windows app, and want to use some pre-installed Python, which is not part of the install package...
    So beside pyenv venvs, please keep Windows devs in mind, too!
    :)

    @zooba
    Copy link
    Member

    zooba commented Aug 6, 2024

    I feel that instead of trying to hijack the executable field here, a better approach would be allowing users to specify a pyvenv.cfg path when embedding

    The better approach is to just set the search paths you want to have, and leave the python.c-specific functionality to python.c.1 venv initialization is not supported in any manner other than the activate script and running the regular Python executable, and I'd prefer to keep it that way.

    There are more than enough configuration options to initialise exactly the runtime you want.

    Footnotes

    1. I know it doesn't live in python.c, but it should. Poor architectural decisions in the past don't necessarily have to force us to grow the feature - we can still avoid that creep.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 9, 2024

    There are more than enough configuration options to initialise exactly the runtime you want.

    @zooba In my use case, the runtime I want to embed is available as a virtual environment set up with standard package installation tooling, so having to reimplement the virtual environment sys.path derivation logic externally (where any deviation from CPython's behaviour would be considered a bug in the embedding application) would be annoying compared to instead telling the runtime:

    1. Use the same sys.path config that this Python executable would use
    2. Use this Python executable when running Python subprocesses (i.e. set it as sys.executable)

    The fact that setting (the init config equivalent of) PYTHONEXECUTABLE works feels like a genuinely good fit for my problem (primarily defining virtual environments for executing Python AI-related code via CPython, but also supporting direct embedding of those virtual environments rather than invoking their Python executable as a subprocess).

    Is there a known way to accomplish this pre 3.11? Specifically in 3.10? And specifically on macOS?

    @benh PYTHONEXECUTABLE was original macOS-only and became cross-platform in Python 3.11, so it's worth a try. If it doesn't work, then I'd say the answer to your question is "No, Python 3.11 will be needed to get this behaviour".

    @zooba
    Copy link
    Member

    zooba commented Aug 9, 2024

    In my use case, the runtime I want to embed is available as a virtual environment set up with standard package installation tooling

    I think we're differing on a few terms/concepts here:

    • the runtime you want to embed is the base of the virtual environment, so either you know which one it is (good) or you're hoping that it's compatible at runtime (risky!)
    • the environment you want is just search paths, not a runtime

    I really don't want to encourage embedding of "whatever version of Python I found on disk". The only time that's going to lead to a good experience is when it's the system interpreter, which isn't "whatever" version. But taking an arbitrary install and doing anything other than launching python[x.y] as a child process is not going to serve you well long-term. It's 100x easier to support arbitrary versions in a Python script with IPC than by embedding - it's even 10x easier to support arbitrary versions that load a native module to move your work into the Python process. Embedding is hard enough without letting users decide to break your app for you.

    If you can ensure that your users are using the base runtime you expect, then if you really need the same search paths, I'd really suggest running the interpreter with -c print(sys.path) and reading that back into your embedding config. Cloning getpath.py is pretty much a non-starter (CPython isn't even consistent with itself, let alone external copies of that logic!), and "things that python.exe does" are not a supported part of the embedding interface (including venv calculations, but also parsing argv and the process environment for settings - these aren't even specified for python.exe apart from implementation).

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 12, 2024

    I know which runtime I am embedding (the primary project is creating and shipping a fully integrated set of Python base runtimes and then virtual environments that use those runtimes).

    Further allowing for embedding applications that use the environments directly via the CPython C API is an alternative we're considering to running components in the environments up as subprocesses and communicating with them via FastAPI (essentially, we'd be adapting the existing C++ application server that is used for Typescript component embedding rather than shipping a separate Python-only application server implementation solely for the Python components and having to maintain two separate implementations of the local client application authentication and authorisation bits).

    At the very least, those embedding apps need to set sys.executable properly for the benefit of Python subprocess invocations, and the right executable path to use for that purpose is the one in the virtual environment being implicitly activated.

    While it's undocumented, that has turned out to also have the effect of getting sys.path set the same way as it would be when running Python in that virtual environment, including loading *.pth files and importing sitecustomize, which is exactly the outcome I want.

    I do think it would be a good idea to document a way to check for runtime compatibility before attempting to start the embedded interpreter when relying on setting PYTHONEXECUTABLE in this way. For example, run the following snippet in the nominally compatible executable and compare the resulting string to the string form of Py_VERSION_HEX & 0xFFFFFF00:

    $python3 -c "import sys; major, minor, micro, *_ = sys.version_info; print(f'{major:#04x}{minor:02x}{micro:02x}00')"
    0x030c0300
    

    (the arcane failures you get when mixing and matching 3.11 venvs with a Python 3.12 runtime, and vice-versa, are cryptic enough that I agree we don't want to risk giving the impression that folks can point an embedding config at an arbitrary virtual environment and expect it to magically work even when the specificed interpreter's ABI is inconsistent with what the embedding application expects)

    @zooba
    Copy link
    Member

    zooba commented Aug 12, 2024

    At the very least, those embedding apps need to set sys.executable properly for the benefit of Python subprocess invocations

    Yeah, this is the intended use of setting the executable. It's expecting/relying on it also doing the pyvenv.cfg lookup that I don't want to commit to. Specifying the search path normally should also resolve *.pth files, unless you suppress import site. (I guess there's a reasonable argument that venv setup should be in site.main() too, though we definitely need to resolve the base runtime earlier so there's no real getting around it).

    I do think it would be a good idea to document a way to check for runtime compatibility before attempting to start the embedded interpreter

    Agreed. Do we not have a pre-init function that returns the hex version already? (I guess I've never worried too much because Windows doesn't have a versionless DLL that gives you non-stable ABI.)

    @ncoghlan
    Copy link
    Contributor

    For dynamic linking, the shared object is versioned, so Py_VERSION_HEX should be trustworthy as far as the second segment (the dynamic linker will outright ignore other versions), but might be wrong in the third field (since different maintenance releases still share an ABI). It's only the main binary that gets an entirely unversioned symlink.

    From a docs point if view, here's my suggestion for a path forward:

    1. Document that setting executable is cross-platform, becomes the value used for sys.executable, and is also used as one of the starting points for locating the standard library
    2. Encourage checking version compatibility between Py_VERSION_HEX & 0xFFFF0000 and python3 -c "import sys; major, minor, *_ = sys.version_info; print(f'{major:#04x}{minor:02x}0000')" to avoid cryptic failures due to attempts to load incompatible versions of standard library modules
    3. Document that setting sys.executable to a Python binary in a virtual environment (that is, with pyvenv.cfg next to it or in the immediate parent folder) without explicitly the setting the base Python interpreter or the import search path will currently result in the base executable and the import search path being configured as if that environment had been activated, but also document that this is an implementation detail that is subject to change at any time.
    4. Document that the future-proof way to set these fields based on a given Python executable is to run an information export script along the lines of the following:
    import sys
    major, minor, *_ = sys.version_info
    print(f'{major:#04x}{minor:02x}0000')" # Version for comparison with `Py_VERSION_HEX`
    print(sys._base_executable) # Value to pass in for base_executable
    for path_entry in sys.path:
        print(path_entry)    # Values to pass in as the import search path
    

    Given the necessary infrastructure in the embedding app to run the version check, the future proof code wouldn't be that much harder to write than just the version check.

    @zooba
    Copy link
    Member

    zooba commented Aug 14, 2024

    It's only the main binary that gets an entirely unversioned symlink.

    I missed this detail (and it's not been brought up by anyone else who I've argued this with 😆 ). So you already need to know which version you're going to be loading.

    We make good enough guarantees for during a feature release that I'd be okay with labelling these as "may change in future versions without a deprecation period". I'm not sure that anything is future-proof though - the future proof way is to hard code the version you expect and then check again whenever you update to a newer one... most people don't consider that future proof!

    Perhaps this is best as a "how to" style page, rather than something that might be confused for a specification? These things come up occasionally, and they don't really have a good home (e.g. we recently changed matplotlib to statically link the C++ runtime to avoid conflicts with other modules, which is not "our" responsibility, but also not obvious how to do it).

    @FFY00
    Copy link
    Member

    FFY00 commented Dec 15, 2024

    I'm going back to the original post, and addressing the raised issues directly based on the current state of things, to understand if they are still valid.

    cc @GrahamDumpleton

    The primary mechanism is usually to search for a 'python' executable on PATH and use that as a starting point. From that it then back tracks up the file system from the bin directory to arrive at what would be the perceived equivalent of PYTHONHOME. The lib/pythonX.Y directory under that for the matching version X.Y of Python being initialised would then be used.

    Problems can often occur with the way this search is done though.

    For example, if someone is not using the system Python installation but has installed a different version of Python under /usr/local. At run time, the correct Python shared library would be getting loaded from /usr/local/lib, but because the 'python' executable is found from /usr/bin, it uses /usr as sys.prefix instead of /usr/local.

    The current code tries determining base_prefix/base_exec_prefix by searching the location of the libpython library loaded in the current process first, and if that fails, it then tries searching the location of the Python interpreter executable.

    cpython/Modules/getpath.py

    Lines 559 to 594 in 7900a85

    # First try to detect prefix by looking alongside our runtime library, if known
    if library and not prefix:
    library_dir = dirname(library)
    if ZIP_LANDMARK:
    if os_name == 'nt':
    # QUIRK: Windows does not search up for ZIP file
    if isfile(joinpath(library_dir, ZIP_LANDMARK)):
    prefix = library_dir
    else:
    prefix = search_up(library_dir, ZIP_LANDMARK)
    if STDLIB_SUBDIR and STDLIB_LANDMARKS and not prefix:
    if any(isfile(joinpath(library_dir, f)) for f in STDLIB_LANDMARKS):
    prefix = library_dir
    if not stdlib_dir_was_set_in_config:
    stdlib_dir = joinpath(prefix, STDLIB_SUBDIR)
    # Detect prefix by looking for zip file
    if ZIP_LANDMARK and executable_dir and not prefix:
    if os_name == 'nt':
    # QUIRK: Windows does not search up for ZIP file
    if isfile(joinpath(executable_dir, ZIP_LANDMARK)):
    prefix = executable_dir
    else:
    prefix = search_up(executable_dir, ZIP_LANDMARK)
    if prefix and not stdlib_dir_was_set_in_config:
    stdlib_dir = joinpath(prefix, STDLIB_SUBDIR)
    if not isdir(stdlib_dir):
    stdlib_dir = None
    # Detect prefix by searching from our executable location for the stdlib_dir
    if STDLIB_SUBDIR and STDLIB_LANDMARKS and executable_dir and not prefix:
    prefix = search_up(executable_dir, *STDLIB_LANDMARKS)
    if prefix and not stdlib_dir:
    stdlib_dir = joinpath(prefix, STDLIB_SUBDIR)

    But looking into the code, I found that this functionality is only available on Windows, or macOS builds using --enable-framework! I opened GH-127970.

    For an embedded system the way this problem was overcome was for it to use Py_SetPythonHome() to forcibly override what should be used for PYTHONHOME so that the correct installation was found and used at runtime.

    Now this would work quite happily even for Python virtual environments constructed using 'virtualenv' allowing the embedded system to be run in that separate virtual environment distinct from the main Python installation it was created from.

    Although this works for Python virtual environments created using 'virtualenv', it doesn't work if the virtual environment was created using pyvenv.

    One can easily illustrate the problem without even using an embedded system.

    $ which python3.4
    /Library/Frameworks/Python.framework/Versions/3.4/bin/python3.4
    
    $ pyvenv-3.4 py34-pyvenv
    
    $ py34-pyvenv/bin/python
    Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.prefix
    '/private/tmp/py34-pyvenv'
    >>> sys.path
    ['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload', '/private/tmp/py34-pyvenv/lib/python3.4/site-packages']
    
    $ PYTHONHOME=/tmp/py34-pyvenv python3.4
    Fatal Python error: Py_Initialize: unable to load the file system codec
    ImportError: No module named 'encodings'
    Abort trap: 6

    The Python home always refers to the base installation, it sets sys.base_prefix/sys.base_exec_prefix, so it is expected that pointing to venv-style virtual environments 1 would result in a bad sys.path.

    Setting PYTHONHOME is not a correct way to enable a venv-style (pyvenv.cfg-based) virtual environment. What you should do instead when embedding is to set PyConfig.program_name.

    Here's an example, as implemented in cocotb/cocotb#4293:

    https://github.com/cocotb/cocotb/blob/708c3f32c4387682827b81b5e8d94339bffdeefa/src/cocotb/share/lib/embed/gpi_embed.cpp#L138-L180

    For the record this issue is affecting Apache/mod_wsgi and right now the only workaround I have is to tell people that in addition to setting the configuration setting corresponding to PYTHONHOME, to use configuration settings to have the same effect as doing:

    Applying this to your use-case in Apache/mod_wsgi. You should only need to set program_name to the correct Python interpreter executable, as shown in the example above.

    Looking at your code, as far as I can tell, it seems your issues come from backwards compatibility, as you allow users to select which Python environment to use via a python-home option in WSGIDaemonProcess, or a WSGIPythonHome directive, which was okay on Python 2, but doesn't work in venv-style environments.

    https://github.com/GrahamDumpleton/mod_wsgi/blob/5b0522e87262ff8139a326e7cc69e0e3662b776e/src/server/wsgi_interp.c#L2271-L2463

    My recommendation would be to deprecate these options for Python 2, and replace them with options to specify the Python interpreter path instead.

    Footnotes

    1. venv refers to them as "lightweight", because they don't need to be a valid/full PYTHONHOME — they don't need to contain a full copy of the stdlib or headers — as opposed to Python 2-style virtual environments.

    FFY00 added a commit to FFY00/cpython that referenced this issue Dec 15, 2024
    FFY00 added a commit to FFY00/cpython that referenced this issue Dec 15, 2024
    @FFY00
    Copy link
    Member

    FFY00 commented Dec 15, 2024

    GH-127972 fixes using the loaded libpython to calculate base_prefix on most platforms, and GH-127974 makes sure we consider base_exec_prefix being the same as base_prefix before we resort to searching the executable directory, which fixes most embedding applications, as users will no longer need to set program_name for base_prefix/base_exec_prefix to be calculated properly. Though, I would still like to remove the need, or deprecate, some of getpath's fragile heuristics.

    Considering that, as long as the PRs go through, I believe the issues raised by the original post are adequately addressed.

    @ncoghlan, the history is a bit too extensive and dense for me to parse right now, would you be able to summarize your particular issues? Perhaps, also, note if you think they are addressed via the proposed changes?

    @ncoghlan
    Copy link
    Contributor

    @FFY00 This part is the same solution I found as well:

    You should only need to set program_name to the correct Python interpreter executable, as shown in the example above.

    However, that's also the behaviour that @zooba is reluctant to elevate from "This happens to work right now, in the current CPython implementation" to "This is the supported way to setup a venv-style environment in an embedding application that is either not linked to CPython, or is linked to the same based runtime as the venv is configured to use".

    Hence the proposed compromise in #66409 (comment), where we would document that it currently works, but be clear that it isn't guaranteed to keep working in the future and explain what to do instead (specifically, do a runtime query of the target environment and use it to perform a version compatibility check, and then appropriately configure the embedded runtime)

    @FFY00
    Copy link
    Member

    FFY00 commented Dec 17, 2024

    Thanks for the background.

    I agree with documenting the program_name workaround. For the proper fix, we can maybe add executable to the path configuration inputs. Right now, setting executable instead of program_name should also work, but executable is not documented as a path initialization input, so that's an implementation detail, setting program_name is kind-of supported.

    From https://docs.python.org/3/c-api/init_config.html#c.PyConfig.program_name:

    Program name used to initialize executable and in early error messages during Python initialization.

    @zooba
    Copy link
    Member

    zooba commented Dec 17, 2024

    Long term I'm keen to see a separation between "parameters for things we want" and "parameters for things that we use to infer other things in the standard python binary". Most of what's currently there falls into the latter category, but I really want to move the inferencing logic into python.c and just have initialization take actual settings. This makes things much easier for embedders, and gives those who want to build an alternate python binary a more flexible template in python.c (rather than the single PyRun_Main call).

    And I'm still convinced that pyvenv.cfg is for inferring settings (specifically, sys.base_* and sys.path, and a couple things in site).

    So I think the most correct workaround would be to document how to infer module search paths from the contents of a pyvenv.cfg and recommend copying that logic to fill out the settings directly. We shouldn't encourage any embedders to provide inputs to our unfortunate algorithm, they should all want to bypass that as thoroughly as possible, or else clone it so they can prove that their own code works.

    The middle ground we have now where embedders just have to poke a black box until it works is the worst of all worlds. Let's not ingrain it any further.

    @hvbtup
    Copy link

    hvbtup commented Dec 20, 2024

    I used the way @FFY00 initialized Python with 3.13 and it works well (just setting config.program_name to the the python.exe auf my virtual environment).

    One needs to have the original (not venv) python in the PATH environment variable to find python313.dll.

    And a question arises:

    After initialization, sys.executable is the python.exe auf my virtual environment, not my executable (which is another *.exe file in a different directory).

    My exe loads and initializes a custom python module. Now if I use the subprocess module to start another interpreter (using sys.executable) or if I use the multisubprocessing module, these subprocesses interpreters cannot use that module.

    This is not an issue for me.
    But if it was, would it be safe to set sys.executable after the initialization to the current executable?

    Part of this is how to find out the current executable on Windows or Linux, but I think that doesn't belong here.

    @picnixz picnixz removed the 3.12 bugs and security fixes label Dec 20, 2024
    @zooba
    Copy link
    Member

    zooba commented Dec 20, 2024

    One needs to have the original (not venv) python in the PATH environment variable to find python313.dll.

    I'd strongly recommend using AddDllDirectory instead (if you're not going to redistribute the runtime). Administrators can disable using PATH for DLL resolution, not to mention anyone can hijack your DLLs, so you have an extra variable to worry about when you rely on it.

    would it be safe to set sys.executable after the initialization to the current executable?

    I thought we had a configuration field for this already? program_name is just an input to our black box of calculating paths, but I'm pretty sure I would've exposed the real thing somewhere.

    If not, then yeah, it should be safe enough to just overwrite it. However, if someone just does subprocess.run(["python", ... then Windows is going to choose a python.exe in your application directory first. So if you have (the wrong) one there, it could still be a problem. But if all your code is using sys.executable then it'll be fine - it's just a variable.

    @ncoghlan
    Copy link
    Contributor

    For the "emulating a venv" use case, I'd recommend setting config.executable (which is then used to set sys.executable) over setting config.program_name.

    Setting program_name is for the case where the given name should NOT be invoked when launching Python subprocesses (usually because the embedding application is not itself a Python interpreter).

    @hvbtup
    Copy link

    hvbtup commented Dec 23, 2024

    Hmm, I think my last comment was misunderstood.
    First of all, while in earlier versions of my software I used the "Windows embeddable ZIP" distribution, I ditched it now. It seemed too exotic. I only used it because venv environment didn't work well on Windows at that time. I am actually using venv in a normal way.

    The only exception to this is that my executable loads and initializes a custom 3rdParty Python module which needs a secret. And my executable has a different name so that it can easily be identified in the MS Windows task manager (or in ps on Linux).

    So, I am not "emulating" a venv.
    I expect an existing Python installation including pip and venv.
    I then use the venv module to create a virtual environment in .venv, use pip inside that virtual environment to install the required 3rdParty packages.
    After configuration, to start my program, I use my custom executable instead of the python.exe from .venv\scripts.
    While python.exe finds python313.dll automagically, my program needs it on the path.

    Regarding AddDllDirectory, I tried adding a call to AddDllDirectory(...); with the path to my Python installation in my C main function before initializing python, but it doesn't even get there when that directory is not contained in PATH.

    Hmm... When venv creates the file scripts\python.exe in the .venv, does it do some DLL load magic?

    @hvbtup
    Copy link

    hvbtup commented Dec 23, 2024

    @zooba

    Could the reason for this difference between python.exe and my own executable be related to this (from #112984 (comment))?

    The venvlauncher.c is just launcher.c boiled down to just the venv parts, and none of the py.exe parts. The only change is that the executable embeds the expected Python executable name at compile time, as we can't just assume it'll be python[_d].exe anymore.

    And it seems somewhat strange that the venvlauncher.exe (which is renamed to python.exe if I understand correctly when the venv is created) needs to start a child process. This is different from the start of python in a normal installation (not in venv). Can you explain the reasoning behind this?

    @FFY00
    Copy link
    Member

    FFY00 commented Dec 23, 2024

    For the "emulating a venv" use case, I'd recommend setting config.executable (which is then used to set sys.executable) over setting config.program_name.

    I think I would disagree with this recommendation as executable is undocumented as a getpath input parameter. It is mostly used for internal testing. program_name, OTOH, is documented as one of the inputs. While we don't define the semantics for the Path calculation very clearly, I think it's safer to rely on that than executable. Unless there's any unwanted side effect from setting program_name, it should be preferred to use it over executable.

    @zooba
    Copy link
    Member

    zooba commented Dec 24, 2024

    First of all, while in earlier versions of my software I used the "Windows embeddable ZIP" distribution, I ditched it now. It seemed too exotic. I only used it because venv environment didn't work well on Windows at that time. I am actually using venv in a normal way.

    You say "in my software" but also "using venv in a normal way", so I'm not clear whether you are making an application that includes a Python runtime (for which the embeddable ZIP is intended), or just trying to avoid running the installer on your own machine?

    While python.exe finds python313.dll automagically, my program needs it on the path.

    There's nothing magical - they're in the same directory, and Windows uses that as the biggest hint to find a dependency.

    Your executable either needs python313.dll in the same directory, or it needs its own launcher that doesn't directly depend on python313.dll, but is able to AddDllDirectory and then launch the executable that depends on it (the setting should be inherited). Alternatively, you could make the second executable a DLL and then LoadLibrary it, which will use the added DLL directory to resolve python313.dll and then you can call through your own interface. On second thought, this is probably the better approach.

    Ultimately, Python is designed for a Unix-style system where all references are embedded absolute paths and nothing can be relocated. On Windows, the closest way to emulate this is to include your own copy of the runtime in your application directory (and this is what the embeddable distro is for).

    It sounds like the best way to do what you seem to be doing (relying on someone else's install of Python) is to get them to provide the python.exe path, launch that with your own script (potentially in a venv created by its own venv module) and your own extension modules that know how to talk back to your application. It's a bit of a workaround, but trust me, it'll make things much easier to test and much for stable for you and your users. Trying to directly load an unknown Python runtime that someone else is managing is not a great idea - it's very easy for circumstances to break it and render your application unusable.

    And it seems somewhat strange that the venvlauncher.exe (which is renamed to python.exe if I understand correctly when the venv is created) needs to start a child process.

    Some ways to install Python don't allow directly loading its DLL, so launching a child process is the only way to do it. It would theoretically be possible to test this at runtime and only use the child process as a fallback (I have actually just been writing code that does this), but again, it's less reliable than using the proper interfaces.

    We also get to take proper advantage of Windows preferring the application directory to ensure that we load the correct dependencies, or else we may crash at runtime. Previously, we either got lucky, or we had to copy practically the entire runtime into the venv to make it work reliably. With a small, self-contained launcher, we don't have to worry about either.

    Out of interest, why do you find this strange? What were you expecting instead, and why?

    @hvbtup
    Copy link

    hvbtup commented Dec 27, 2024

    whether you are making an application that includes a Python runtime (for which the embeddable ZIP is intended), or just trying to avoid running the installer on your own machine?

    Neither nor.

    After installing Python and creating a venv with python -m venv venv and setting the VIRTUAL_ENV environment variable to the venv directory, the only intended differences between running my executable and running the python.exe from venv\scripts is that

    • my program needs to load a 3rdParty Python module (generated with SWIG) for which a license key is needed at runtime and the executable takes care of providing the license key
    • and the executable has its own name such that it can be clearly identified in the task manager.

    So, basically, my executable is "python.exe + a properly configured 3rd party module".

    There's nothing magical - they're in the same directory, ...

    No, they are not.
    The venv\scripts directory contains python.exe, but not python313.dll.
    And I can run that python.exe directly, with a clean PATH:

    set PATH=%WINDIR%\System32;%WINDIR%;\%WINDIR%\System32\wbem
    C:\path\to\my_application\venv\Scripts>python.exe
    

    By looking into the source code of venvlauncher.c, I see that the magic happens there. As the comment at the top of the file says:

    This launcher looks for a nearby pyvenv.cfg to find the correct home directory, and then launches the original Python executable from it.

    So the reason for the creation of the python.exe child process of python.exe is exactly this DLL loading issue: To make sure that we find the python*.dll in the same directory as the executable.

    My executable code does not use this logic, which is quite complex and requires very good knowledge of the Windows process creation API.

    Duplicating this logic in my own code would be the worst solution.

    Even if that "find python home (including DLL) from a virtual environment by parsing pyvenv.cfg" would be available as a part of the initialization API, this would probably still not help in my use-case, because launching the python.exe from that python home would loose the initialization of the 3rdParty module.

    I think adding the Python home to the PATH is the most reasonable solution for my use-case.
    Note that I control the PATH environment variable and use a clean PATH when I launch the executable.
    From a security point of view, it is the same as how venv with python.exe works: An attacker would need write access to the python installation or the application installation.

    Probably my use-case is so special that I'll have to live with this little inconvenience.

    Trying to directly load an unknown Python runtime that someone else is managing is not a great idea

    I can't resist to say: This is exactly what venv is doing.

    @zooba
    Copy link
    Member

    zooba commented Dec 27, 2024

    Trying to directly load an unknown Python runtime that someone else is managing is not a great idea

    I can't resist to say: This is exactly what venv is doing.

    We control both venv and Python. If you happen to make a venv point to a Python that we didn't provide, you'll hit issues.

    My executable code does not use this logic, which is quite complex and requires very good knowledge of the Windows process creation API.

    If it makes you feel any better, we've copy-pasted that code around more than a few times. It tries to handle many of the edge cases that people expect to Just Work, and mostly handles the most common ones (at least those presumed by POSIX developers, who expect it to work like exec).

    • my program needs to load a 3rdParty Python module (generated with SWIG) for which a license key is needed at runtime and the executable takes care of providing the license key

    I'm still intrigued by this. Not that I doubt it, but I've only come across one of these concepts before that couldn't be solved in some easier way.1 Can you provide any hint as to how this mechanism works? Or a link to docs, if it's documented somewhere? I'm interested.

    But as you say:

    Probably my use-case is so special that I'll have to live with this little inconvenience.

    Yeah, I'm afraid so. Unless there's a sudden spate of similar irresolvable requests, I'm not keen to make the venv resolution a supported API here. (Supported API meaning anyone can call it, the behaviour is consistent across all platforms, safe across most [mis]configurations, and guaranteed not to change without a proper deprecation cycle. The internal behaviour of venvlauncher and getpath are not supported APIs in this sense - only the external behaviour of launching Scripts/python.exe in an activated venv are guaranteed.)

    Footnotes

    1. That one is OpenSSL, which expects to GetProcAddress() on the running executable, which is fundamentally not a nice idea at all. But we haven't managed to get a patch contributed, and so we simply patch our own OpenSSL build to get the address from the _ssl module instead.

    FFY00 added a commit that referenced this issue Jan 8, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir tests Tests in the Lib/test dir topic-venv Related to the venv module type-feature A feature request or enhancement
    Projects
    Status: Todo
    Development

    No branches or pull requests