Highlights

TBE GPU

Added support for int64_t table indices and offsets in TBE inference
Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)

TBE CPU

Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
Make FloatToFloat16 conversion 75x faster using SVE2 instructions
Added FP32 GEMM kernels

TBE SSD

Fix OOM issues during init
Improvements to L1 and L2 flush

Gen AI Ops

GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
Various FP8 grouped GEMM optimizations
BF16I4 preshuffled grouped GEMM
BF16 stacked grouped GEMM
F8I4 grouped GEMM optimizations
Added nccl_alltoall function

ROCm

Added preliminary ROCm OSS build support for GenAI ops

Better Engineering

Added build support for CUDA 12.8
Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

PyTorch: v2.7
CUDA: v11.8, 12.6, 12.8
Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

Changes

CPU

GEMM

[Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
[New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
[New] Add generic IEEE754 truncation code (#3820)
[New] Enable KleidiAI for FP32 (#3818)
[Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
[Fix] Use kleidiAI on static builds (#3806)
[Fix] Fix KleidiAI FP16 (#3769)
[Improvement] Pull ARM's matrix transpose PR (#3660)
[New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
[Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
[Improvement] Remove FENV pragma (#3629)
[Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
[New] add a new constructor to PackedGemmMatrixB (#3598)
[New] Move FP32 kernels to OSS (#3568)

GenAI

GenAI Ops

[Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
[New] Add harness for comms benchmark (#3936)
[Improvement] Refactoring of NoPE (#3840)
[Improvement] support fp16 dtypes for input weight and bias (#3931)
[Fix] fix fp8 kv cache dequantize kernels (#3896)
[Fix] fix fp8 kv cache dequantize kernels (#3896)
[Improvement] scatter_add 0 size support (#3861)
[Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
[Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
[Fix] Fix empty input view. (#3880)
[New] FP8 Rowwise Dequant Kernel (#3873)
[New] torch.ops.fbgemm.gather_scale_dense_tokens for oss. (#3855)
[Improvement] Replace rms_norm as norm (#3841)
[Improvement] Move DeepGemm scale transpose to quantize (#3834)
[Improvement] follow up to reflect rowwise scale inputs for x, w in quantize_ops scripts (#3839)
[New] add rowwise scaling support (#3822)
[Improvement] update to tune for small ms and quantized gemv (#3712)
[New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
[New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
[Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
[Fix] fix volatile synchronization with acquire/relax (#3728)
[Improvement] Force determinism by unswizzle (#3727)
[New] add fp8 kv nope (#3786)
[Improvement] move common op to vector utils (#3759)
[Improvement] Gather/Scatter. (#3743)
[Improvement] reduce scatter supports last dim (#3726)
[Improvement] Add custom reduce scatter to llama_comms (#3730)
[New] Adds shapes information to enable torch.compile. (#3724)
[Improvement] avoid propagation of NaN (#3723)
[New] torch.ops.fbgemm.scatter_add_along_first_dim.. (#3720)
[New] torch.ops.fbgemm.gather_along_first_dim. (#3719)
[New] Paged Attention Support (#3698)
[New] custom reduce scatter (#3686)
[Fix] Recover custom collective test (#3687)
[Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
[New] add fp8fp8 fast_gemv_quantized (#3677)
[New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
[Improvement] adjust interface (#3669)
[Improvement] CK MoE: cherry-pick #1808 (#3609)
[Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
[Improvement] Return if no data to allreduce (#3586)
[Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
[Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
[New] Add nccl_alltoall function (#3551)
[New] Add fused_moe kernel to ck_extension (#3518)

GEMM

[Improvement] Update cutlass verison to 3.8V2 (#3772)
[Improvement] Update Cutlass to V3.8-2 (#3767)
[Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
[New] new tuning for fp8 rowwise (#3756)
[Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
[Improvement] Enable DeepGEMM in quantize bench (#3745)
[Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
[Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
[Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
[Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
[Improvement] Update CK (#3701)
[Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
[New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
[Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
[New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
[Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
[Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
[New] Modernize bf16 cutlass grouped gemm (#3889)
[Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
[New] Add DEEPGEMM Masked API. (#3949)
[Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
[Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
[New] BF16I4 Preshuffled Grouped Gemm (#3917)
[New] Preshuffled BF16I4 Gemm Kernel (#3913)
[New] Enable rowwise scaling for DeepGemm (#3874)
[New] bf16 stacked group gemm (#3888)
[New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)

FP8

[Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
[Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
[Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
[Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
[New] Enable FP8 Triton dequantized block-wise kernel (#3788)
[Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
[Improvement] changing config for fp8 gemm (#3668)
[Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
[New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
[Improvement] write_k_back for fp8 ROPE (#3679)
[Improvement] Moves utility functions into a standalone file. (#3671)
[Fix] Fix f8f8bf16_lite quantize op input in quantize_and_compute (#3667)
[Improvement] Optimize zero fill (#3666)
[Improvement] FP8 Grouped Gemm Optimization (#3655)
[New] Add sweep_utils.py script to tune heuristics (#3656)
[Improvement] loose unit test atol rtol tolerance to eliminate ut flakiness (#3664)
[New] Port oss f16_fast_gemv into fbcode (#3610)
[New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
[Improvement] k_norm in rope for fp8 kv cache (#3633)
[Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
[Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
[Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
[Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
[New] Support FP8 grouped GEMM with rowwise scailing (#3560)
[Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
[Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
[Improvement] amd fp8 rowwise batched gemm tuning (#3624)
[Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
[New] amd fp8 rowwise gemm prefill shape tuning (#3607)
[New] Enable fast FP8 GEMM for memory bound (resubmit) (#3608)
[Improvement] Make zero_start_index_M optional for dynamic FP8 grouped gemm on AMD (#3604)
[Improvement] Enable fast FP8 GEMM for memory bound (#3577)
[Improvement] more fp8 tuning for decode and not need to pad (#3576)
[Improvement] Enable fast FP8 GEMM for memory bound (#3577)

Triton

[Improvement] Uses FastAccum=True by default for Triton GroupedGEMM. (#3919)
[Improvement] Handle 0 inputs for gmm (#3901)
[New] Triton GroupedGEMM. WS. (#3912)
[Improvement] No recompilation caused by varying sequence lengths. (#3903)
[Improvement] Enable bufferops for non-persistent fp8 rowwise GEMM (#3898)
[Improvement] Makes use_fast_accum configurable. (#3829)
[Fix] Fix triton group gemm for tp4 (#3762)
[Improvement] Reduce tuning. (#3754)
[Improvement] [fbgemm_gpu] Upgrade Triton to latest (#3736)
[New] GroupedGEMM for AMD. (#3729)
[Improvement] GroupedGEMM interface takes m_sizes instead of m_offsets. (#3696)
[Fix] Numerical Fix. (#3688)
[New] Adds Triton based GroupedGEMM implementation. (#3674)
[Improvement] Add optional zero_start_index_M argument to triton fp8 rowwise quantization (#3628)
[Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)

TBE

TBE GPU

[Improvement] Fix flaky TBE unit tests (#3938)
[Fix] Fix get_infos_metadata meta dispatch (#3946)
[Improvement] Change set_learning_rate_tensor (#3945)
[Improvement] Cleanups to StochasticRoundingRNGState (#3922)
[New] Unifying TBE API using List (Frontend) - reland (#3821)
[Improvement] Add tests for bounds_check_indices v2 (#3920)
[Improvement] Use bounds_check_indices v2 on ROCm (#3916)
[Fix] Partial revert D70855331 (#3925)
[Fix] Add a workaround for stochastic rounding for AMD GPUs (#3908)
[New] AdagradW (fbgemm frontend) (#3850)
[New] AdagradW (fbgemm backend) (#3827)
[Fix] Fix IMA in TBE grad indices kernel for int32 indices (#3877)
[Improvement] Use PackedAccessor64 for index_remappings in pruned_array_lookup (#3870)
[Improvement] Add overflow_safe_int_t for addressing the int overflow problem (#3875)
[Improvement] Add torch.jit.script to unit tests (#3869)
[Improvement] Replace LR access with wrapper (#3849)
[Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/bench/verify_fp16_stochastic_benchmark.cu +10 (#3845)
[Improvement] Allow FBGEMM_TBE_BOUNDS_CHECK_MODE to take effect when using mode 4,5,6 (#3838)
[Improvement] replace device param with bounds_check_warning of inputs_to_device function (#3831)
[Improvement] Packed bag parameters tuning (#3805)
[Improvement] Symintify max_B and max_D (#3807)
[Improvement] make lazy init tunable (#3811)
[Improvement] Log feature gate statuses in TBE init (#3792)
[Fix] Backout Unifying TBE API using List (Frontend) (#3803)
[Improvement] Migrate TBE benchmark utilities over to TBE, pt 5b (#3802)
[Improvement] Migrate TBE benchmark utilities over to TBE, pt 4 (#3794)
[Fix] fix bounds check v2 mode with vbe input (#3758)
[Improvement] Migrate TBE benchmark utilities over to TBE, pt 3 (#3785)
[Fix] Fix prev_iter (#3784)
[Improvement] Migrate TBE benchmark utilities over to TBE, pt 2 (#3783)
[Improvement] Enable int32_t support for reshape_vbe_offsets (#3782)
[Improvement] Migrate TBE benchmark utilities over to TBE (#3781)
[New] Unifying TBE API using List (Frontend) (#3711)
[Improvement] Implement inference bag packing along D (#3541) (#3771)
[Improvement] Add test for VBE CPU (#3778)
[Improvement] Change the TBE bounds check to match the TBE implementation. (#3773)
[Improvement] Implement generate_vbe_metadata cpu (#3715)
[New] Compute info_B_num_bits from T to make it a constant (#3748)
[Improvement] Move execute_backward_adagrad into a class (#3744)
[Improvement] Add more helper methods for TBE benchmarking (#3747)
[Improvement] Move execute_backward_adagrad into a class (#3744)
[Improvement] Add barrier to test regression hypothesis (#3741)
[Improvement] annotate tensors in schema for PT2 interface (#3738)
[Fix] Create cpu iterator irrespective of optimizer choice (#3689)
[Fix] Fix the TBE cache_precision to fp32 when on ROCm (#3672)
[New] Unifying TBE API using List (Backend) (#3563)
[New] Updating split_table_batched_embeddings_ops_training.py (#3613)
[New] Support INT4 Dequant onto GPU for Seq INT TBE look up (#3584)
[Fix] Fix calling numel on symbolic shapes issue (#3621)
[Fix] fix pre_iter fp32 inaccuracy issue (#3623)
[Improvement] directly pass update_util as int flag without syncing iter (#3602)
[New] : basic tbe input dump framework (#3593)
[New] Add support for int32_t indices in TBE training (2K/N) (#3583)
[New] Add support for int32_t indices in TBE training (2H/N) (#3539)
[Misc] Enable v2 forward test for ROCm (#3573)
[Fix] Fix bug in ROCm optimized forward pass (#3599)
[New] Add support for int32_t indices in TBE training (2I/N) (#3556)
[New] Add support for int32_t indices in TBE training (2F/N) (#3376)
[Fix] Back out "Optimzed backward pass for ROCm devices (pt 2)" (#3587)
[Fix] Revert D65620886 (#3582)
[New] Add support for int32_t indices in TBE training (2G/N) (#3377)
[New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
[New] Add support for int32_t indices in TBE training (2E/N) (#3375)
[Improvement] Do not call scalar_type (#3394)
[Fix] Remove torch.jit.script (#3562)
[New] Add support for int32_t indices in TBE training (2D/N) (#3374)
[New] Add support for int32_t indices in TBE training (3/N) (#3372)
[New] Add support for int32_t indices in TBE training (2B/N) (#3371)
[Improvement] Optimzed backward pass for ROCm devices (pt 2) (#3511)

TBE SSD

[Improvement] Reduce bulk init time and fix OOM (#3828)
[Improvement] Engage KVTensor aware checkpoint load paths. (#3718)
[Fix] uncomment accidently commented out unittest (#3716)
[Improvement] sync wait before L1 and L2 flush (#3709)
[Improvement] return right away if keys is empty (#3658)
[Improvement] Adding couple more APIs to KVTensorWrapper to bring partiy with torch::Tensor (#3645)
[Improvement] Move embedding_rocksdb_wrapper to its own header. (#3622)
[Fix] Fix autodeps for torch/custom_class.h and use it in kv_tensor_wrapper_cpu (#3600)
[Improvement] put KVTensorWrapper in its own header (#3575)

Other Ops

Inplace Ops

[Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/src/embedding_inplace_ops/embedding_inplace_update.cu +10 (#3846)

Permute Ops

[New] support permute_multi_embedding_function on torch.export (#3897)
[Improvement] do not call permute on empty tensor (#3705)

Quantize Ops

[New] implement packed quantize row / dequantize row API (#3915)
[New] Add small M support (#3682)
[Improvement] Provide helper functions for int4 quantization (#3775)
[Improvement] test fp8fp8bf16/bf16fp8bf16_fast_gemv is torch compileable (#3809)
[Improvement] Eliminate MemCpyDtoH overhead for quantized fast_gemv kernel (#3725)
[Improvement] Add abstract impl for Fused8BitRowwiseQuantizedToFloatOrHalf et al. (#715) (#3640)
[New] Fold ops registration code (#3634)
[Fix] Fix type to assert group size (#3601)

Sparse Ops

[Improvement] nested dispatching of segment_csr on cpu/gpu (#3881)
[Improvement] Fix faketensor error when dev_weights is undefined. (#3755)
[Improvement] Set default value to null (#3732)
[New] Support histogram_binning_calibration for export (#3657)
[Fix] fix data type of block_bucketize_pos in block_bucketize_sparse_features (#3589)
[Fix] Fix specailization issue in keyed_jagged_index_select_dim1_forward_cuda (#3578)

SLL Ops

[Improvement] Re-organize SLL ops, pt 9 (#3665)
[Improvement] Re-organize SLL ops, pt 8 (#3663)
[Improvement] Re-organize SLL ops, pt 7 (#3650)
[Improvement] Re-organize SLL ops, pt 6 (#3647)
[Improvement] Re-organize SLL ops, pt 5 (#3646)
[Improvement] Re-organize SLL ops, pt 4 (#3644)
[Improvement] Re-organize SLL ops, pt 3 (#3652)
[Improvement] Re-organize SLL ops, pt 2 (#3643)
[Improvement] Re-organize SLL ops, pt 1 (#3642)
[Improvement] Fold ops registration code, pt 3 (#3641)
[Improvement] Fold ops registration code, pt 2 (#3635)

Benchmarks

[Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)
[New] Enable multi-processing in CPU TBE micro-benchmarks (#3753)
[Improvement] Improve VBE benchmark (#3867)
[Improvement] Clean up stochastic rounding benchmarks (#3876)
[Fix] Fix EEG indices estimator op (#3852)
[New] Expose EEG indices estimation to Python (#3836)
[Improvement] Support MTIA for device and device-with-spec (#3832)
[Improvement] fix benchmark logging after script reorganization (#3819)
[Improvement] Cleanups for the EEG-based TBE benchmark CLI, pt 2 (#3815)
[New] [fbgemm_gpu] Add benchmark workflows (#3713)
[Improvement] Add cache-precision arg to TBE device_with_spec bench (#3791)
[New] Migrate EEG-based TBE benchmark code to OSS (#3780)
[New] Migrate TBE EEG Python code to OSS (#3774)
[New] Migrate TBE EEG C++ code to OSS (#3765)
[New] Add TBEDataConfig to Python side (#3739)
[Improvement] Add Quantize benchmark (#3706)
[Improvement] Adding iterations to benchmark script (#3708)
[Improvement] Small modifications to quantize_bench script (#3684)
[Improvement] Add option to set cache precision in TBE benchmark (#3659)
[Improvement] Add tracing option to quantize bench (#3651)
[Improvement] Add preprocess stage to quantize bench operators (#3648)
[Improvement] Pre-convert indices/offsets in TBE bench (#3595)
[Improvement] Allow reusing input data in TBE benchmark (#3594)
[Improvement] Profile with kineto and warmup for more accurate benchmarking (#3580) (#3585)

Better Engineering

Builds

[Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
[New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
[New] [AMD] Add CK to dependencies to enable AMD build. (#3929)
[Fix] [fbgemm_gpu] Fix Nova package labeling for GenAI (#3933)
[Fix] [fbgemm_gpu] Update Nova jobs (#3890)
[Improvement] update hipify_torch (#3918)
[Improvement] [fbgemm_gpu] Fix setup scripts for OSS ROCm (#3909)
[Fix] [fbgemm_gpu] Fix undefined symbol error (#3900)
[Improvement] Add FB python sources into genai CMakeLists.txt (#3886)
[Improvement] [fbgemm_gpu] Update CMakeLists.txt for experimental/genai (#3872)
[Improvement] [fbgemm_gpu] Increase timeout for Nova jobs (#3871)
[New] [fbgemm_gpu] Update Nova CI configuration to support B200 (#3868)
[Fix] [fbgemm_gpu] Fix bash line to work with macOS builds (#3863)
[Improvement] Add option to set build parallelism in OSS workflows (#3859)
[New] [fbgemm_gpu] Support newer CUDA architectures in OSS (#3848)
[Improvement] [fbgemm_gpu] Increase CUDA test timeout (#3810)
[Fix] [fbgemm] Fix compilation issues with GCC 14.1 (#3804)
[Improvement] [fbgemm_gpu] Limit the number of ROCm hardware targets (#3797)
[Improvement] Fix clang vla warning (#2736)
[Fix] [fbgemm_gpu] Fix PT2 wrapper registrations (#3721)
[Fix] [fbgemm_gpu] Fix python docs not being visible (#3717)
[New] [fbgemm_gpu] Nova job update to support building against CUDA 12.8 (#3704)
[New] [fbgemm_gpu] Add CUDA 12.8 build support (#3700)
[Improvement] [fbgemm_gpu] Test genai op registration (#3692)
[Improvement] [fbgemm_gpu] Break down fbgemm_gpu_tbe_training_backward module further, pt 3 (#3694)
[Improvement] [fbgemm_gpu] Save built docs as GHA artifact (#3695)
[Improvement] Rename sources to avoid internal build issue (#3697)
[Improvement] [fbgemm_gpu] Break down CMake module further, pt 2 (#3681)
[Fix] Fix linting CI error introduced in D69213404 (#3683)
[Improvement] [fbgemm_gpu] Break down CMake module further (#3673)
[New] [fbgemm_gpu] GitHub PR scraper (#3661)
[New] [fbgemm_gpu] Add macro support for NVCC and HIPCC specific flags (#3636)
[Improvement] add AMD specific includes in cuda_prelude.h (#3614)
[Fix] [fbgemm_gpu] Fix CMakeLists.txt for experimental/gemm (#3592)
[Improvement] [fbgemm_gpu] Upgrade GitHub actions (#3581)
[Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
[Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
[New] [fbgemm_gpu] Add build support for AMD MI300 (#3566)
[Improvement] [fbgemm_gpu] Expand test timeout for ROCm pip install workflow (#3557)
[Improvement] [fbgemm_gpu] Update triton version for OSS (#3555)
[Fix] [fbgemm_gpu] Fix versioning scheme for ROCm releases (#3554)
[Misc] [fbgemm_gpu] Properly disable TBE SSD tests in OSS (#3548)

Documentation

[New] [fbgemm_gpu] Add docs for GenAI package (#3905)
[Improvement] [fbgemm_gpu] Update docs (#3891)
[Misc] Add comments to TBE inference PackedMode (#3789)
[Misc] [fbgemm_gpu] Update ROCm and CUDA versions in docs (#3569)
[New] [fbgemm_gpu] Add documentation for Feature Gates (#3740)
[Improvement] Remove erroneous comment (#3733)
[Fix] [fbgemm_gpu] Minor doc fix (#3618)

Utils

[New] Better kernel launch utilities (#3914)
[Improvement] Add ability to save and load data into HostDeviceBufferPair (#3899)
[New] Add abstractions for writing out data (flesh out D71147675, pt 1) (#3856)
[New] Add feature gate for HIP-based backward kernel (#3835)
[Misc] Optionally use env vars for config lookup (#3795)
[Fix] Updates and fixes to tensor_accessor.h (#3571)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FBGEMM v1.2.0 Release Notes

Highlights

TBE GPU

TBE CPU

TBE SSD

Gen AI Ops

ROCm

Better Engineering

Software Requirements

Availability

Changes

CPU

GEMM

GenAI

GenAI Ops

GEMM

FP8

Triton

TBE

TBE GPU

TBE SSD

Other Ops

Inplace Ops

Permute Ops

Quantize Ops

Sparse Ops

SLL Ops

Benchmarks

Better Engineering

Builds

Documentation

Utils