|
8 | 8 |
|
9 | 9 | This is used to optimize dict and attribute lookups, among other things.
|
10 | 10 |
|
11 |
| -Python uses three different mechanisms to intern strings: |
| 11 | +Python uses two different mechanisms to intern strings: singletons and |
| 12 | +dynamic interning. |
12 | 13 |
|
13 |
| -- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros. |
14 |
| - These are statically allocated, and collected using `make regen-global-objects` |
15 |
| - (`Tools/build/generate_global_objects.py`), which generates code |
16 |
| - for declaration, initialization and finalization. |
| 14 | +## Singletons |
17 | 15 |
|
18 |
| - The difference between the two kinds is not important. (A `_Py_ID` string is |
19 |
| - a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain |
20 |
| - non-identifier characters, so it needs a separate C-compatible name.) |
| 16 | +The 256 possible one-character latin-1 strings, which can be retrieved with |
| 17 | +`_Py_LATIN1_CHR(c)`, are stored in statically allocated arrays, |
| 18 | +`_PyRuntime.static_objects.strings.ascii` and |
| 19 | +`_PyRuntime.static_objects.strings.latin1`. |
21 | 20 |
|
22 |
| - The empty string is in this category (as `_Py_STR(empty)`). |
| 21 | +Longer singleton strings are marked in C source with `_Py_ID` (if the string |
| 22 | +is a valid C identifier fragment) or `_Py_STR` (if it needs a separate |
| 23 | +C-compatible name.) |
| 24 | +These are also stored in statically allocated arrays. |
| 25 | +They are collected from CPython sources using `make regen-global-objects` |
| 26 | +(`Tools/build/generate_global_objects.py`), which generates code |
| 27 | +for declaration, initialization and finalization. |
23 | 28 |
|
24 |
| - These singletons are interned in a runtime-global lookup table, |
25 |
| - `_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
26 |
| - at runtime initialization. |
| 29 | +The empty string is one of the singletons: `_Py_STR(empty)`. |
27 | 30 |
|
28 |
| -- The 256 possible one-character latin-1 strings are singletons, |
29 |
| - which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global |
30 |
| - arrays, `_PyRuntime.static_objects.strings.ascii` and |
31 |
| - `_PyRuntime.static_objects.strings.latin1`. |
| 31 | +The three sets of singletons (`_Py_LATIN1_CHR`, `_Py_ID`, `_Py_STR`) |
| 32 | +are disjoint. |
| 33 | +If you have such a singleton, it (and no other copy) will be interned. |
32 | 34 |
|
33 |
| - These are NOT interned at startup in the normal build. |
34 |
| - In the free-threaded build, they are; this avoids modifying the |
35 |
| - global lookup table after threads are started. |
| 35 | +These singletons are interned in a runtime-global lookup table, |
| 36 | +`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
| 37 | +at runtime initialization, and immutable until it's torn down |
| 38 | +at runtime finalization. |
| 39 | +It is shared across threads and interpreters without any synchronization. |
36 | 40 |
|
37 |
| - Interning a one-char latin-1 string will always intern the corresponding |
38 |
| - singleton. |
39 | 41 |
|
40 |
| -- All other strings are allocated dynamically, and have their |
41 |
| - `_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
42 |
| - When interned, such strings are added to an interpreter-wide dict, |
43 |
| - `PyInterpreterState.cached_objects.interned_strings`. |
| 42 | +## Dynamically allocated strings |
44 | 43 |
|
45 |
| - The key and value of each entry in this dict reference the same object. |
| 44 | +All other strings are allocated dynamically, and have their |
| 45 | +`_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
| 46 | +When interned, such strings are added to an interpreter-wide dict, |
| 47 | +`PyInterpreterState.cached_objects.interned_strings`. |
46 | 48 |
|
47 |
| -The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`) |
48 |
| -are disjoint. |
49 |
| -If you have such a singleton, it (and no other copy) will be interned. |
| 49 | +The key and value of each entry in this dict reference the same object. |
50 | 50 |
|
51 | 51 |
|
52 | 52 | ## Immortality and reference counting
|
53 | 53 |
|
54 |
| -Invariant: Every immortal string is interned, *except* the one-char latin-1 |
55 |
| -singletons (which might but might not be interned). |
| 54 | +Invariant: Every immortal string is interned. |
56 | 55 |
|
57 | 56 | In practice, this means that you must not use `_Py_SetImmortal` on
|
58 | 57 | a string. (If you know it's already immortal, don't immortalize it;
|
@@ -115,8 +114,5 @@ The valid transitions between these states are:
|
115 | 114 | Using `_PyUnicode_InternStatic` on these is an error; the other cases
|
116 | 115 | don't change the state.
|
117 | 116 |
|
118 |
| -- One-char latin-1 singletons can be interned (0 -> 3) using any interning |
119 |
| - function; after that the functions don't change the state. |
120 |
| - |
121 |
| -- Other statically allocated strings are interned (0 -> 3) at runtime init; |
| 117 | +- Singletons are interned (0 -> 3) at runtime init; |
122 | 118 | after that all interning functions don't change the state.
|
0 commit comments