Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an ASM code generator / compiler #39

Closed
mratsim opened this issue Jun 7, 2020 · 0 comments · Fixed by #69
Closed

Implement an ASM code generator / compiler #39

mratsim opened this issue Jun 7, 2020 · 0 comments · Fixed by #69
Labels
constant time ⏳ Enhancement is suitable for secret data performance 🏁

Comments

@mratsim
Copy link
Owner

mratsim commented Jun 7, 2020

Unfortunately for both performance and security reasons, it is important for generic cryptographic libraries to implement a code generator.

The most wildly used code generators are:

Instead of complicating the build system, we can directly implement the code-generator using Nim metaprogramming features.

This ensures that unused assembly is not compiled in.
This significantly simplifies the build system.

A proof of concept code generator for multiprecision addition is available here

macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
var asmStmt = (block:
" movq %[b], %[tmp]\n" &
" addq %[tmp], %[a]\n"
)
let maxByteOffset = bits div 8
const wsize = sizeof(uint64)
when defined(gcc):
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "+%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "+%[a]\n"
)
elif defined(clang):
# https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
for byteOffset in countup(wsize, maxByteOffset-1, wsize):
asmStmt.add (block:
"\n" &
# movq 8+%[b], %[tmp]
" movq " & $byteOffset & "%[b], %[tmp]\n" &
# adcq %[tmp], 8+%[a]
" adcq %[tmp], " & $byteOffset & "%[a]\n"
)
let tmp = ident("tmp")
asmStmt.add (block:
": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
": \"cc\""
)
result = newStmtList()
result.add quote do:
var `tmp`{.noinit.}: uint64
result.add nnkAsmStmt.newTree(
newEmptyNode(),
newLit asmStmt
)

Security

General purposes compiler can and do rewrite code as long as any observable effect is maintained. Unfortunately timing is not considered an observable effect and as general purpose compiler gets smarter and branch prediction on processor gets also smarter, compilers recognize and rewrite increasingly more initial branchless code to code with branches, potentially exposing secret data.

A typical example is conditional mov which is required to be constant-time any time secrets are involved (https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-08#section-4)
The paper What you get is what you C: Controlling side effects in mainstream C compilers (https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf) exposes how compiler "improvements" are detrimental to cryptography
image

Another example is secure erasing secret data, which is often elided as an optimization.

Those are not theoretical exploits as explained in the When constant-time doesn't save you article (https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/) which explains an attack against Curve25519 which was designed to be easily implemented in a constant-time manner.
This attacks is due to an "optimization" in MSVC compiler

every code compiled in 32-bit with MSVC on 64-bit architectures will call llmul every time a 64-bit multiplication is executed.

Verification of Assembly

The assembly code generated needs special tooling for formal verification that is different from the C code in #6.
Recently Microsoft Research introduced Vale:

Performance

Beyond security, compilers do not expose several primitives that are necessary for necessary for multiprecision arithmetic.

Add with carry, sub with borrow

The most egregious example is add with carry which led to the GMP team to implement everything in Assembly even though this is a most basic need and almost all processor have an ADC instruction, some like the 6502 from 30 years ago only have ADC and no ADD.
See:

image

Some specific platforms might expose add with carry, for example x86 but even then the code generation might be extremely poor: https://gcc.godbolt.org/z/2h768y

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

(Reported fixed but it is not? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317)

And no way to use ADC for ARM architectures with GCC.
Clang does offer __builtin_addcll which might work now or not as fixing the add with carry for x86 took years. Alternatively Clang does offer new arbitrary width integer since a month ago, called ExtInt http://blog.llvm.org/2020/04/the-new-clang-extint-feature-provides.html it is unknown however if code is guaranted to be constant-time.

See also: https://stackoverflow.com/questions/29029572/multi-word-addition-using-the-carry-flag/29212615

Conditional move

Unlike add with carry which can be expressed but may lead to inefficient code, conditional move basically require assembly (see the security section) as there is no builtin at all.
A constant-time conditional move based on xor-masking would require 4-5 instructions instead of just 1.

MULX

On x86-64, the multiplication instruction is bottlenecked by the fact that the result is always in RAX and RDX registers which means that multiplications cannot be interleaved and exploit instruction level parallelism.

ADCX/ADOX

On x86-64, the add with carry instruction is bottlenecked by the fact that there is a single carry flag which means that additions with carry cannot be bottlenecked and exploit instruction level parallelism.

Furthermore, compilers like GCC and Clang were not designed to track carry chains (GCC) or can only track a single chain (Clang).

ADCX and ADOX enable having carry chains used respectively the Carry flag and the Overflow flag for carry chain.

https://gcc.gnu.org/legacy-ml/gcc-help/2017-08/msg00100.html

The compiler is not able to distingusih between OF and CF chains,
since both are represented as a different mode of a single flags
register. This is the limitation of the compiler.

https://bugs.llvm.org/show_bug.cgi?id=34249#c6

Resolving - the crash was fixed by rL310784 and we avoid ADOX/ADCX code generation now.

In MCL and Goff, combining MULX/ADCX and ADOX improves speed by about 20% on field multiplication.

See intel papers:

@mratsim mratsim added constant time ⏳ Enhancement is suitable for secret data performance 🏁 labels Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
constant time ⏳ Enhancement is suitable for secret data performance 🏁
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant