high throughput inference #663

msaroufim · 2024-08-12T22:05:10Z

Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like

Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
Demonstrate compelling perf for weight only int8 gemm at a variety of batch sizes.
Demonstrate compelling perf for weight only intX gemm at low batch sizes.
Demonstrate compelling perf for weight intX, activation fp8 at a variety of batch sizes.

We could as a baseline extend gpt-fast to work with bs=n w/o doing any kv cache management work and measure perf there. Copying feedback as is, open to discussing more and adding more details as time progresses

EDIT: gpt-fast already has a batched generation branch by Horace https://github.com/pytorch-labs/gpt-fast/tree/batched_generation

msaroufim · 2024-08-13T15:53:57Z

@HDCharles on the int8 work
@vkuzo on fp8
@vayuda and @jerryzh168 on intx

jeromeku · 2024-08-13T18:56:38Z

@msaroufim

Would be interesting to bench against something like QoQ, which implements W4A8KV4 (int8 GEMM) using a nested quantization scheme and neat kernel-level optimizations.

vkuzo · 2024-08-13T19:00:15Z

Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.

Note that I'm putting up a PR soon for a quick roofline estimator for float8 gemm + overhead specific to training to see for which M, K, N float8 is faster than bfloat16, it would be easiliy extendable to inference at a later time.

Demonstrate compelling perf for weight intX, activation fp8 at a variety of batch sizes.

While this is possible technically, I'm not sure I understand the value, would be interested to learn more.

msaroufim added rfc inference labels Aug 12, 2024

msaroufim changed the title ~~chilli feedback~~ high throughput inference Aug 12, 2024

msaroufim mentioned this issue Aug 17, 2024

[RFC] Which low bit CUDA kernels should we merge or write? #697

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high throughput inference #663

high throughput inference #663

msaroufim commented Aug 12, 2024 •

edited

Loading

msaroufim commented Aug 13, 2024

jeromeku commented Aug 13, 2024

vkuzo commented Aug 13, 2024

high throughput inference #663

high throughput inference #663

Comments

msaroufim commented Aug 12, 2024 • edited Loading

msaroufim commented Aug 13, 2024

jeromeku commented Aug 13, 2024

vkuzo commented Aug 13, 2024

msaroufim commented Aug 12, 2024 •

edited

Loading