Skip to content

Latest commit

 

History

History
509 lines (378 loc) · 23 KB

bit-precise-types.rst

File metadata and controls

509 lines (378 loc) · 23 KB

Rationale Document for ABI related to the C23 _BitInt type.

Preamble

Background

This document describes the rationale behind the ABI choices made for using the bit-precise integral types defined in C2x. These are _BitInt(N) and unsigned _BitInt(N). These are defined for integral N and each N is a different type.

The proposal for these types can be found in the following link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf

As the rationale in that proposal mentioned, some applications have uses for a specific bit-width type. In the case of writing C code which can be used to determine FPGA hardware these specific bit-width types can lead to large performance and space savings.

From the perspective of the Arm ABI we have some trade-offs and decisions to make:

  • We need to choose a representation for these objects in registers.
  • We need to choose a representation, size and alignment of these objects in memory.

The main trade-offs we have identified in this case are:

  • Performance of different C-level operations.
  • Whether certain hardware-level atomic operations are possible.
  • Size cost of storing values in memory.
  • General familiarity of programmers with the representation.

Since this is a new type, there is large uncertainty on how it will be used by programmers in the future. Decisions we make here may also influence future usage. We must make trade-off decisions within this uncertainty. The below attempts to analyze possible use-cases to make our best guess as to how these types may be used when targeting Arm CPUs.

Use-cases known of so far

We believe there are two regimes for these types. The "small" regime where bit-precise types could be stored in a single general-purpose register, and the "large" regime where bit-precise types must span multiple general-purpose registers.

Here we discuss the use-cases for bit-precise integer types that we have identified or been alerted to so far.

C code to describe FPGA behavior

A major motivating use-case for this new type is to aid writing C code which describes the desired behavior of an FPGA. Without the availability of the new _BitInt type such C code would semantically have much wider types than necessary when performing operations, especially given that operations on small integral types promote their operands to int.

If these wider than necessary operations end up in the FPGA they would use many more many more logic gates than necessary. Using _BitInt allows programmers to write code which directly expresses what is needed. This can ensure the FPGA description generated saves space and has better performance.

The notable thing about this use-case is that though the C code may be run on an Arm architecture for testing, the most critical use is when transferred to an FPGA (that is, not an Arm architecture).

However, if the operation that this FPGA performs becomes popular there may be a need to run the code directly on CPUs in the future.

The requirements on the Arm ABI from this use-case are relatively small since the main focus is around running on an FPGA. We believe the use-case adds weight to both the need for performance and familiarity of programmers. This belief comes from the estimate that this may lead to bit-precise types being used in performance critical code in the future, and that it may mean that bit-precise types are used on Arm architectures when testing FPGA descriptions (where ease of debugging can be prioritized).

24-bit Color

Some image file-types use 24-bit color. The new _BitInt type may be used to hold such information.

As it stands we do not know of any specific reason to use a bit-precise integral type as opposed to a structure of three bytes for these data types.

If used for 24-bit color we believe that the performance of standard arithmetic operations would not be critical. This because each 24-bit pixel usually represents three 8-bit colors so operations would unlikely be performed on the single value as a whole.

We also believe that if used for 24-bit color it would be helpful to specify a size and alignment scheme such that an array of _BitInt(24) is well packed.

Networking Protocols

Many networking protocols have packed structures in order to minimize data sent over the wire. In order to be perfectly packed the code will need to use bit-fields rather than bit-precise types for storage, since the bit-precise types must be accessible and hence at least byte-aligned.

The incentives to use bit-precise integral types for networking code would be in order to maintain the best representation of the operation that is being performed.

One negative of using bit-precise integral types for networking code would be that idioms like if (x + y > max_representable) where x and y have been loaded from small bit-fields, would no longer be viable. We have seen such idioms for small values in networking code in the Linux kernel. These are intuitive to write but if x and y were to bit-precise types would not work as expected.

If used in code handling networking protocols, our estimate is that the arithmetic manipulation performed on such values will not be the main performance bottleneck in networking protocols. This estimation comes from the belief that networking is often IO bound, and that small packed values in networking protocols tend to have limited arithmetic performed on them.

Hence we believe that ease of debugging of values in registers may be more critical than performance concerns in this use-case.

To help the compiler optimize (possibly for auto vectorization)

The behavior that bit-precise types do not automatically promote to an int during operations could remove some casts which are necessary for C semantics but can obscure the intention of a users code. One place this may help is in auto vectorization, where the compiler must be able to see through intermediate casts in order to identify the operations being performed.

The incentive for this use-case is an increased likelihood of the compiler generating optimal autovectorized code.

One point which might imply less take-up of this use-case is that programmers willing to put in extra effort to ensure good vectorization of a loop have the option to use compiler intrinsics. This means that using bit-precise types would be a mid-range option providing less-guaranteed codegen improvement for less effort.

The ABI should not have much of an affect on this use-case directly, since the optimization would be done in the target-independent part of compilers and the eventual operations in auto vectorized code would be acting on vector machine types.

That said, bit-precise types would also be used in the surrounding code. Given that in this use-case these types are added for performance reasons it seems reasonable to guess that this concern around performance would apply to the surrounding code as well. Hence it seems that this use-case would benefit from choosing performance concerns.

In this use-case the programmer would be converting a codebase using either 8 bit integers or 16 bit integers to a bit-precise type of the same size. Such a codebase may include calls to variadic functions (such as printf) in surrounding code. Variadic functions like this may be missed when changing types in a codebase, so it would be helpful if the bit-precise machine types passed matched what the relevant standard integral types looked like in order to avoid extra difficulties during the conversion. The C semantics require that variadic arguments undergo standard integral promotions. While int8_t and the like undergo integral promotion, _BitInt does not. Hence this use-case would benefit from having the representation of _BitInt(8) in the PCS match that of int and similar for the 16 bit and unsigned variants (which implies having them sign- or zero-extended).

Decisions which do not affect 8 and 16 bit types would not affect this use-case.

For representing cryptography algorithms

Many cryptography algorithms perform operations on large objects. It seems to be that using a _BitInt(128) or _BitInt(256) could express cryptographic algorithms more concisely.

For symmetric algorithms the existing block cipher and hash algorithms do not tend to operate on chunks this size as single integers. This seems like it will remain the case due to CPU limitations and a desire to understand the performance characteristics of written algorithms.

For asymmetric algorithms something like elliptic curve cryptography seems like it could gain readability from using the new bit-precise types. However there would likely be concern around whether code generated from using these types is guaranteed to use constant-time operations.

This use-case would only be using "large" bit-precise types. Moreover all relevant sizes are powers of two.

Translating some more esoteric languages to C

At the moment there exist some high-level languages which support arbitrary bit-width integers. Translating such languages to C would benefit from the new C type.

We do not know of any specific use-case within these languages other than for cryptography algorithms as above. Hence the trade-offs in this space are assumed to be based on the trade-offs from the cryptography use-case above.

We estimate the use of translating a more esoteric language to C to be less common than writing code directly in C. Hence the weighting of this use-case in our trade-offs is correspondingly lower than others.

Possible transparent BigNum libraries

We have heard of interest in using the new bit-precise integer types to implement transparent BigNum libraries in C.

Such a use-case unfortunately does not directly correspond to what kind of code will be using this (for example it doesn't indicate whether this code would be algorithmic or I/O bound). Given the mention of 512x512 matrices in the discussion where we heard this use-case we assume that in general such a library would be CPU-bound code.

Hence we assume that the main consideration here would be performance.

Summary of use-case trade-offs

In our estimation, the C to FPGA use case seems to be the most promising. We estimate that the use in this space will be a great majority of the use of this new type.

Uses for cryptography, networking, and in order to help the compiler optimize certain code seem like they are large enough to consider but not as widespread.

For the C to FPGA use case, the majority of the use is not expected to be seen on Arm Architectures. For helping the compiler optimize code we expect to only see bit-precise types with sizes matching that of standard integral types. Cryptographic uses are only expected on "large" sizes which are powers of two. Networking uses are likely to be using bit-fields for in-memory representations.

All use-cases would have concerns around performance and the familiarity of representations. There does not seem to be a clear choice to prefer one or the other.

Alignment and sizes

Options and their trade-offs

These types must be at least byte-aligned so they are addressable, and at least rounded to a byte boundary in size for sizeof.

"Small" regime

For the "small" regime there are 2 obvious options:

  1. Byte alignment.
  2. Alignment and size "as if" stored in the next-largest Fundamental Data Type. (Where the Fundamental Data Types are defined in the relevant PCS documents).

Option A has the following benefits:

  • Better packing in an array of _BitInt(24) than an array of int32_t. This is more relevant for bit-precise types than others since these types have an aesthetic similarity to bit-fields and hence programmers might expect good packing.

Option B has the following benefits (both following from the alignment being greater than or equal to the size of the object in memory):

  • Avoid a performance hit since loading and storing of these "small" sized _BitInt's will not cross cache boundaries.
  • The representation of bit-precise types of the same size as standard integer types will have the same alignment and size in memory.
  • Atomic loads and stores can be made on these objects.

In the use-cases we have identified above we did not notice any special need for tight packing. All of the use-cases we identified would benefit from better performance characteristics, and the use-case to help the compiler in optimizing some code would benefit greatly from _BitInt(8) having the same alignment and size as a int8_t.

Hence for "small" sizes we are choosing to define a _BitInt(N) size and alignment according to the smallest Fundamental Data Type which has a bit-size greater or equal to N. Similar for unsigned versions.

"Large" regime

For "large" sizes the only approach considered has been to treat these bit-precise types as an array of M sized chunks, for some M.

There are two obvious choices for M:

  1. Register sized.
  2. Double-register sized.

Option A has the following benefits:

  • This would mean that the alignment of a _BitInt(128) on AArch64 matches that of other architectures which have already defined their ABI. This could reduce surprises when writing portable code.
  • Less space used for half of the large values of N.
  • Multiplications on large _BitInt(N) can be performed using chunks of size M, which should result in a neater compiler implementation. For example AArch64 has an SMULH instruction which could be used as part of a multiplication of an entire chunk.

Option B has the following benefit:

  • On AArch32 a _BitInt(64) would have the same alignment and size as an int64_t, and on AArch64 a _BitInt(128) would have the same alignment and size as a __int128.
  • Double-register sized integers match the largest Fundamental Data Types defined in the relevant PCS architectures for both platforms. We believe that that developers familiar with the Arm ABI would find this mapping less surprising and hence make less mistakes. This includes those working at FFI boundaries interfacing to the C ABI.
  • Would allow atomic operations on types in the range between register and double-register sizes. This is due to the associated extra alignment allowing operations like CASP on AArch64 and LDRD on AArch32. Similarly this would allow LDP and STP single-copy atomicity on architectures with the LSE2 extension.

The "large" size use-cases we have identified so far are of power-of-two sizes. These sizes would not benefit greatly from the positives of either of the options presented here, with the only difference being in the implementation of multiplication.

Our estimate is that the benefits of option B are more useful for sizes between register and double-register than those from option A. This is not considered a clear-cut choice, with the main point in favour of option A being a smaller difference from other architectures psABI choices.

Other variants are available, such as choosing alignment and size based on register sized chunks except for the special case of the double-register sized _BitInt. Though such variants can provide a good combination of the properties above we judge the extra complexity of definition to have an associated increased likelyhood of mistakes when developers code relies on ABI choices.

Based on the above reasoning, we would choose to define the size and alignment of _BitInt(N > [register-size]) types by treating them "as if" they are an array of double-register sized Fundamental Data Types.

Representation in bits

There are two decisions around the representation of a "small" _BitInt that we have identified. (1) Whether required bits are stored in the least significant end or most significant end of a register or region in memory. (2) Whether the "remaining" bits are specified after rounding up to the size specified in Alignment and sizes. The choice of how "remaining" bits would be specified would tie in to the choice made for (1).

Options and their trade-offs

We have identified three viable options:

  1. Required bits stored in most significant end. Not-required bits are specified as zero at ABI boundaries.
  2. Required bits stored in least significant end. Not-required bits are unspecified at ABI boundaries.
  3. Required bits stored in least significant end. Not-required bits are specified as zero- or sign-extended.

While it would be possible to make different requirements for bit-precise integer types in memory vs in registers, we believe that the combined negatives of the choice are reason enough to not look into the option. These negatives being that code would have to perform a transformation on loading and storing values, and that different representations in memory and registers is likely to cause programmer confusion.

Similarly, it would be possible to define a representation in registers that does something like specifying bits [2-7] of a _BitInt(2) but leaves bits [8-63] unspecified. This would seem to choose the worst of both worlds in terms of performance, since one must both ensure "overflow" from an addition of _BitInt(2) types does not affect the specified bits and ensure that the unspecified bits above bit number 7 do not affect multiplication or division operations. Hence we do not look at variations of this kind.

For option A there is an extra choice around how "large" values are stored. One could either have the "padding" bits in the least significant "chunk", or the most significant "chunk". Having these padding bits in the least significant chunk would mean require something like a widening cast would require updating every "chunk" in memory, hence we assume large values of option A would be represented with the padding bits in the most significant chunk.

Option A has the following benefits:

  • Operations +,-,%,==,<=,>=,<,>,<< all work without any extra instructions (which is more of the common operations than other representations).
  • For small values in memory, on AArch64, the operations like LDADD and LD{S,U}MAX both work (assuming the relevant register operand is appropriately shifted).

It has the following negatives:

  • This would be a less familiar representation to programmers. Especially the fact that a _BitInt(8) would not have the same representation in a register as a char could cause confusion (we imagine when debugging, or writing assembly code). This would likely be increased if other architectures that programmers may use have a more familiar representation.
  • Operations *,/, saving and loading values to memory, and casting to another type would all require extra cost.
  • Operations +,- on "large" values (greater than one register) would require an extra instruction to "normalize" the carry-bit.
  • If used in calls to variadic functions which were written for standard integral types this can give surprising results.

Option B has the following benefits:

  • Operations +,-,*,<<, narrowing conversions, and loading/storing to memory would all naturally work.
  • On AArch64 this would most likely match the expectation of developers, and small power-of-two sizes would have the same representation as standard types in registers. For example a _BitInt(8) would have the same representation as a char in registers.
  • For small values in memory, the AArch64 LDADD operations work naturally.

It has the following negatives:

  • Operations /,%,==,<,>,<=,>=,>> and widening conversions on operands coming from an ABI boundary would require masking the operands.
  • On AArch32 this could cause surprises to developers, given that on this architecture small Fundamental Data Types are have zero- or sign-extended extra bits. So a char would not have the same representation as a _BitInt(8) on this architecture.
  • If used in calls to variadic functions which were written for standard integral types this can give surprising results.
  • The AArch64 LD{S,U}MAX operations would not work naturally on small values of this representation.

Option C has the following benefits:

  • Operations ==,<,<=,>=,>,>>, widening conversions, and loading/storing to memory would all naturally work.
  • On AArch32 this could match the expectation of developers, with a _BitInt(8) in a register matching the representation of a char.
  • If used in variadic function calls, mismatches between _BitInt types and standard integral types would not cause as much of a problem.
  • For small values in memory, the AArch64 LD{S,U}MAX operations work naturally.

It has the following negatives:

  • Operations +,-,*,<< would all cause the need for masking at an ABI boundary.
  • On AArch64 this would not match the expectation of developers, with _BitInt(8) not matching the representation of a char.
  • The AArch64 LDADD operations would not work naturally.

Summary, suggestion, and reasoning

Overall it seems that option A is more performant for operations on small values. However, when acting on "large" values (here defined as greater than the size of one register) it loses some of that benefit. Storing to and from memory would also come at a cost for this representation. This is also likely to be the most surprising representation for developers on an Arm platform.

Between option B and option C there is not a great difference in performance characteristics. However it should be noted that option C is the most natural extension of the AArch32 PCS rules for unspecified bits in a register containing a small Fundamental Data Type, while option B is the most natural extension of the similar rules in AArch64 PCS. Another distinction between the two is that option C would mean that accidental misuse of a bit-precise type instead of a standard integral type should not cause problems, while B could give strange values. This would be most visible with variadic functions.

As mentioned above, both performance concerns and a familiar representation are valuable in the use-cases that we have identified. This has made the decision non-obvious. We have chosen to favor representation familiarity.

Choosing between C and B is also non-obvious. It seems relatively clear to choose option C for AArch32. We choose option B for AArch64 to prefer that across most ABI boundaries a char and a _BitInt(8) have the same representation, but acknowledge that this could cause surprise to programmers when using variadic functions.