This document describes the rationale behind the ABI choices made for using the
bit-precise integral types defined in C2x. These are _BitInt(N)
and
unsigned _BitInt(N)
. These are defined for integral N
and each N
is
a different type.
The proposal for these types can be found in the following link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf
As the rationale in that proposal mentioned, some applications have uses for a specific bit-width type. In the case of writing C code which can be used to determine FPGA hardware these specific bit-width types can lead to large performance and space savings.
From the perspective of the Arm ABI we have some trade-offs and decisions to make:
- We need to choose a representation for these objects in registers.
- We need to choose a representation, size and alignment of these objects in memory.
The main trade-offs we have identified in this case are:
- Performance of different C-level operations.
- Whether certain hardware-level atomic operations are possible.
- Size cost of storing values in memory.
- General familiarity of programmers with the representation.
Since this is a new type, there is large uncertainty on how it will be used by programmers in the future. Decisions we make here may also influence future usage. We must make trade-off decisions within this uncertainty. The below attempts to analyze possible use-cases to make our best guess as to how these types may be used when targeting Arm CPUs.
We believe there are two regimes for these types. The "small" regime where bit-precise types could be stored in a single general-purpose register, and the "large" regime where bit-precise types must span multiple general-purpose registers.
Here we discuss the use-cases for bit-precise integer types that we have identified or been alerted to so far.
A major motivating use-case for this new type is to aid writing C code which
describes the desired behavior of an FPGA. Without the availability of the new
_BitInt
type such C code would semantically have much wider types than
necessary when performing operations, especially given that operations on small
integral types promote their operands to int
.
If these wider than necessary operations end up in the FPGA they would use many
more many more logic gates than necessary. Using _BitInt
allows programmers
to write code which directly expresses what is needed. This can ensure the FPGA
description generated saves space and has better performance.
The notable thing about this use-case is that though the C code may be run on an Arm architecture for testing, the most critical use is when transferred to an FPGA (that is, not an Arm architecture).
However, if the operation that this FPGA performs becomes popular there may be a need to run the code directly on CPUs in the future.
The requirements on the Arm ABI from this use-case are relatively small since the main focus is around running on an FPGA. We believe the use-case adds weight to both the need for performance and familiarity of programmers. This belief comes from the estimate that this may lead to bit-precise types being used in performance critical code in the future, and that it may mean that bit-precise types are used on Arm architectures when testing FPGA descriptions (where ease of debugging can be prioritized).
Some image file-types use 24-bit color. The new _BitInt
type may be used to
hold such information.
As it stands we do not know of any specific reason to use a bit-precise integral type as opposed to a structure of three bytes for these data types.
If used for 24-bit color we believe that the performance of standard arithmetic operations would not be critical. This because each 24-bit pixel usually represents three 8-bit colors so operations would unlikely be performed on the single value as a whole.
We also believe that if used for 24-bit color it would be helpful to specify a
size and alignment scheme such that an array of _BitInt(24)
is well packed.
Many networking protocols have packed structures in order to minimize data sent over the wire. In order to be perfectly packed the code will need to use bit-fields rather than bit-precise types for storage, since the bit-precise types must be accessible and hence at least byte-aligned.
The incentives to use bit-precise integral types for networking code would be in order to maintain the best representation of the operation that is being performed.
One negative of using bit-precise integral types for networking code would be
that idioms like if (x + y > max_representable)
where x
and y
have
been loaded from small bit-fields, would no longer be viable. We have seen
such idioms for small values in networking code in the Linux kernel. These are
intuitive to write but if x
and y
were to bit-precise types would not
work as expected.
If used in code handling networking protocols, our estimate is that the arithmetic manipulation performed on such values will not be the main performance bottleneck in networking protocols. This estimation comes from the belief that networking is often IO bound, and that small packed values in networking protocols tend to have limited arithmetic performed on them.
Hence we believe that ease of debugging of values in registers may be more critical than performance concerns in this use-case.
The behavior that bit-precise types do not automatically promote to an int
during operations could remove some casts which are necessary for C semantics
but can obscure the intention of a users code. One place this may help is in
auto vectorization, where the compiler must be able to see through intermediate
casts in order to identify the operations being performed.
The incentive for this use-case is an increased likelihood of the compiler generating optimal autovectorized code.
One point which might imply less take-up of this use-case is that programmers willing to put in extra effort to ensure good vectorization of a loop have the option to use compiler intrinsics. This means that using bit-precise types would be a mid-range option providing less-guaranteed codegen improvement for less effort.
The ABI should not have much of an affect on this use-case directly, since the optimization would be done in the target-independent part of compilers and the eventual operations in auto vectorized code would be acting on vector machine types.
That said, bit-precise types would also be used in the surrounding code. Given that in this use-case these types are added for performance reasons it seems reasonable to guess that this concern around performance would apply to the surrounding code as well. Hence it seems that this use-case would benefit from choosing performance concerns.
In this use-case the programmer would be converting a codebase using either 8
bit integers or 16 bit integers to a bit-precise type of the same size. Such a
codebase may include calls to variadic functions (such as printf
) in
surrounding code. Variadic functions like this may be missed when changing
types in a codebase, so it would be helpful if the bit-precise machine types
passed matched what the relevant standard integral types looked like in order to
avoid extra difficulties during the conversion. The C semantics require that
variadic arguments undergo standard integral promotions. While int8_t
and
the like undergo integral promotion, _BitInt
does not. Hence this use-case
would benefit from having the representation of _BitInt(8)
in the PCS match
that of int
and similar for the 16
bit and unsigned variants (which
implies having them sign- or zero-extended).
Decisions which do not affect 8 and 16 bit types would not affect this use-case.
Many cryptography algorithms perform operations on large objects. It seems
to be that using a _BitInt(128)
or _BitInt(256)
could express
cryptographic algorithms more concisely.
For symmetric algorithms the existing block cipher and hash algorithms do not tend to operate on chunks this size as single integers. This seems like it will remain the case due to CPU limitations and a desire to understand the performance characteristics of written algorithms.
For asymmetric algorithms something like elliptic curve cryptography seems like it could gain readability from using the new bit-precise types. However there would likely be concern around whether code generated from using these types is guaranteed to use constant-time operations.
This use-case would only be using "large" bit-precise types. Moreover all relevant sizes are powers of two.
At the moment there exist some high-level languages which support arbitrary bit-width integers. Translating such languages to C would benefit from the new C type.
We do not know of any specific use-case within these languages other than for cryptography algorithms as above. Hence the trade-offs in this space are assumed to be based on the trade-offs from the cryptography use-case above.
We estimate the use of translating a more esoteric language to C to be less common than writing code directly in C. Hence the weighting of this use-case in our trade-offs is correspondingly lower than others.
We have heard of interest in using the new bit-precise integer types to implement transparent BigNum libraries in C.
Such a use-case unfortunately does not directly correspond to what kind of code will be using this (for example it doesn't indicate whether this code would be algorithmic or I/O bound). Given the mention of 512x512 matrices in the discussion where we heard this use-case we assume that in general such a library would be CPU-bound code.
Hence we assume that the main consideration here would be performance.
In our estimation, the C to FPGA use case seems to be the most promising. We estimate that the use in this space will be a great majority of the use of this new type.
Uses for cryptography, networking, and in order to help the compiler optimize certain code seem like they are large enough to consider but not as widespread.
For the C to FPGA use case, the majority of the use is not expected to be seen on Arm Architectures. For helping the compiler optimize code we expect to only see bit-precise types with sizes matching that of standard integral types. Cryptographic uses are only expected on "large" sizes which are powers of two. Networking uses are likely to be using bit-fields for in-memory representations.
All use-cases would have concerns around performance and the familiarity of representations. There does not seem to be a clear choice to prefer one or the other.
These types must be at least byte-aligned so they are addressable, and at least
rounded to a byte boundary in size for sizeof
.
For the "small" regime there are 2 obvious options:
- Byte alignment.
- Alignment and size "as if" stored in the next-largest Fundamental Data Type. (Where the Fundamental Data Types are defined in the relevant PCS documents).
Option A
has the following benefits:
- Better packing in an array of
_BitInt(24)
than an array ofint32_t
. This is more relevant for bit-precise types than others since these types have an aesthetic similarity to bit-fields and hence programmers might expect good packing.
Option B
has the following benefits (both following from the alignment being
greater than or equal to the size of the object in memory):
- Avoid a performance hit since loading and storing of these "small" sized
_BitInt
's will not cross cache boundaries. - The representation of bit-precise types of the same size as standard integer types will have the same alignment and size in memory.
- Atomic loads and stores can be made on these objects.
In the use-cases we have identified above we did not notice any special need for
tight packing. All of the use-cases we identified would benefit from better
performance characteristics, and the use-case to help the compiler in optimizing
some code would benefit greatly from _BitInt(8)
having the same alignment
and size as a int8_t
.
Hence for "small" sizes we are choosing to define a _BitInt(N)
size and
alignment according to the smallest Fundamental Data Type which has a bit-size
greater or equal to N
. Similar for unsigned
versions.
For "large" sizes the only approach considered has been to treat these
bit-precise types as an array of M
sized chunks, for some M
.
There are two obvious choices for M
:
- Register sized.
- Double-register sized.
Option A
has the following benefits:
- This would mean that the alignment of a
_BitInt(128)
on AArch64 matches that of other architectures which have already defined their ABI. This could reduce surprises when writing portable code. - Less space used for half of the large values of
N
. - Multiplications on large
_BitInt(N)
can be performed using chunks of sizeM
, which should result in a neater compiler implementation. For example AArch64 has anSMULH
instruction which could be used as part of a multiplication of an entire chunk.
Option B
has the following benefit:
- On AArch32 a
_BitInt(64)
would have the same alignment and size as anint64_t
, and on AArch64 a_BitInt(128)
would have the same alignment and size as a__int128
. - Double-register sized integers match the largest Fundamental Data Types defined in the relevant PCS architectures for both platforms. We believe that that developers familiar with the Arm ABI would find this mapping less surprising and hence make less mistakes. This includes those working at FFI boundaries interfacing to the C ABI.
- Would allow atomic operations on types in the range between register
and double-register sizes.
This is due to the associated extra alignment allowing operations like
CASP
on AArch64 andLDRD
on AArch32. Similarly this would allowLDP
andSTP
single-copy atomicity on architectures with the LSE2 extension.
The "large" size use-cases we have identified so far are of power-of-two sizes. These sizes would not benefit greatly from the positives of either of the options presented here, with the only difference being in the implementation of multiplication.
Our estimate is that the benefits of option B
are more useful for sizes
between register and double-register than those from option A
. This is not
considered a clear-cut choice, with the main point in favour of option A
being a smaller difference from other architectures psABI choices.
Other variants are available, such as choosing alignment and size based on
register sized chunks except for the special case of the double-register sized
_BitInt
. Though such variants can provide a good combination of the
properties above we judge the extra complexity of definition to have an
associated increased likelyhood of mistakes when developers code relies on ABI
choices.
Based on the above reasoning, we would choose to define the size and alignment
of _BitInt(N > [register-size])
types by treating them "as if" they are an
array of double-register sized Fundamental Data Types.
There are two decisions around the representation of a "small" _BitInt
that
we have identified. (1) Whether required bits are stored in the least
significant end or most significant end of a register or region in memory. (2)
Whether the "remaining" bits are specified after rounding up to the size
specified in Alignment and sizes. The choice of how "remaining" bits would
be specified would tie in to the choice made for (1).
We have identified three viable options:
- Required bits stored in most significant end. Not-required bits are specified as zero at ABI boundaries.
- Required bits stored in least significant end. Not-required bits are unspecified at ABI boundaries.
- Required bits stored in least significant end. Not-required bits are specified as zero- or sign-extended.
While it would be possible to make different requirements for bit-precise integer types in memory vs in registers, we believe that the combined negatives of the choice are reason enough to not look into the option. These negatives being that code would have to perform a transformation on loading and storing values, and that different representations in memory and registers is likely to cause programmer confusion.
Similarly, it would be possible to define a representation in registers that
does something like specifying bits [2-7]
of a _BitInt(2)
but leaves
bits [8-63]
unspecified. This would seem to choose the worst of both worlds
in terms of performance, since one must both ensure "overflow" from an addition
of _BitInt(2)
types does not affect the specified bits and ensure that
the unspecified bits above bit number 7 do not affect multiplication or division
operations. Hence we do not look at variations of this kind.
For option A
there is an extra choice around how "large" values are stored.
One could either have the "padding" bits in the least significant "chunk", or
the most significant "chunk". Having these padding bits in the least
significant chunk would mean require something like a widening cast would
require updating every "chunk" in memory, hence we assume large values of option
A
would be represented with the padding bits in the most significant chunk.
Option A
has the following benefits:
- Operations
+,-,%,==,<=,>=,<,>,<<
all work without any extra instructions (which is more of the common operations than other representations). - For small values in memory, on AArch64, the operations like
LDADD
andLD{S,U}MAX
both work (assuming the relevant register operand is appropriately shifted).
It has the following negatives:
- This would be a less familiar representation to programmers. Especially the
fact that a
_BitInt(8)
would not have the same representation in a register as achar
could cause confusion (we imagine when debugging, or writing assembly code). This would likely be increased if other architectures that programmers may use have a more familiar representation. - Operations
*,/
, saving and loading values to memory, and casting to another type would all require extra cost. - Operations
+,-
on "large" values (greater than one register) would require an extra instruction to "normalize" the carry-bit. - If used in calls to variadic functions which were written for standard integral types this can give surprising results.
Option B
has the following benefits:
- Operations
+,-,*,<<
, narrowing conversions, and loading/storing to memory would all naturally work. - On AArch64 this would most likely match the expectation of developers, and
small power-of-two sizes would have the same representation as standard types
in registers. For example a
_BitInt(8)
would have the same representation as achar
in registers. - For small values in memory, the AArch64
LDADD
operations work naturally.
It has the following negatives:
- Operations
/,%,==,<,>,<=,>=,>>
and widening conversions on operands coming from an ABI boundary would require masking the operands. - On AArch32 this could cause surprises to developers, given that on this
architecture small Fundamental Data Types are have zero- or sign-extended
extra bits. So a
char
would not have the same representation as a_BitInt(8)
on this architecture. - If used in calls to variadic functions which were written for standard integral types this can give surprising results.
- The AArch64
LD{S,U}MAX
operations would not work naturally on small values of this representation.
Option C
has the following benefits:
- Operations
==,<,<=,>=,>,>>
, widening conversions, and loading/storing to memory would all naturally work. - On AArch32 this could match the expectation of developers, with a
_BitInt(8)
in a register matching the representation of achar
. - If used in variadic function calls, mismatches between
_BitInt
types and standard integral types would not cause as much of a problem. - For small values in memory, the AArch64
LD{S,U}MAX
operations work naturally.
It has the following negatives:
- Operations
+,-,*,<<
would all cause the need for masking at an ABI boundary. - On AArch64 this would not match the expectation of developers, with
_BitInt(8)
not matching the representation of achar
. - The AArch64
LDADD
operations would not work naturally.
Overall it seems that option A
is more performant for operations on small
values. However, when acting on "large" values (here defined as greater than
the size of one register) it loses some of that benefit. Storing to and from
memory would also come at a cost for this representation. This is also likely
to be the most surprising representation for developers on an Arm platform.
Between option B
and option C
there is not a great difference in
performance characteristics. However it should be noted that option C
is
the most natural extension of the AArch32 PCS rules for unspecified bits in a
register containing a small Fundamental Data Type, while option B
is the
most natural extension of the similar rules in AArch64 PCS. Another distinction
between the two is that option C
would mean that accidental misuse of a
bit-precise type instead of a standard integral type should not cause problems,
while B
could give strange values. This would be most visible with variadic
functions.
As mentioned above, both performance concerns and a familiar representation are valuable in the use-cases that we have identified. This has made the decision non-obvious. We have chosen to favor representation familiarity.
Choosing between C
and B
is also non-obvious. It seems relatively clear
to choose option C
for AArch32. We choose option B
for AArch64 to
prefer that across most ABI boundaries a char
and a _BitInt(8)
have the
same representation, but acknowledge that this could cause surprise to
programmers when using variadic functions.