Highlights
TBE GPU
- Added support for
int64_t
table indices and offsets in TBE inference - Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)
TBE CPU
- Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
- Make FloatToFloat16 conversion 75x faster using SVE2 instructions
- Added FP32 GEMM kernels
TBE SSD
- Fix OOM issues during init
- Improvements to L1 and L2 flush
Gen AI Ops
- GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
- Various FP8 grouped GEMM optimizations
- BF16I4 preshuffled grouped GEMM
- BF16 stacked grouped GEMM
- F8I4 grouped GEMM optimizations
- Added nccl_alltoall function
ROCm
- Added preliminary ROCm OSS build support for GenAI ops
Better Engineering
- Added build support for CUDA 12.8
- Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors
Software Requirements
FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:
- PyTorch: v2.7
- CUDA: v11.8, 12.6, 12.8
- Python: v3.9, 3.10, 3.11, 3.12, 3.13
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).
Availability
FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:
# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0
# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0
# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/
# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu
# FBGEMM GenAI
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu
Changes
CPU
GEMM
- [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
- [New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
- [New] Add generic IEEE754 truncation code (#3820)
- [New] Enable KleidiAI for FP32 (#3818)
- [Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
- [Fix] Use kleidiAI on static builds (#3806)
- [Fix] Fix KleidiAI FP16 (#3769)
- [Improvement] Pull ARM's matrix transpose PR (#3660)
- [New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
- [Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
- [Improvement] Remove FENV pragma (#3629)
- [Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
- [New] add a new constructor to PackedGemmMatrixB (#3598)
- [New] Move FP32 kernels to OSS (#3568)
GenAI
GenAI Ops
- [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
- [New] Add harness for comms benchmark (#3936)
- [Improvement] Refactoring of NoPE (#3840)
- [Improvement] support fp16 dtypes for input weight and bias (#3931)
- [Fix] fix fp8 kv cache dequantize kernels (#3896)
- [Fix] fix fp8 kv cache dequantize kernels (#3896)
- [Improvement] scatter_add 0 size support (#3861)
- [Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
- [Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
- [Fix] Fix empty input view. (#3880)
- [New] FP8 Rowwise Dequant Kernel (#3873)
- [New]
torch.ops.fbgemm.gather_scale_dense_tokens
for oss. (#3855) - [Improvement] Replace rms_norm as norm (#3841)
- [Improvement] Move DeepGemm scale transpose to quantize (#3834)
- [Improvement] follow up to reflect rowwise scale inputs for x, w in
quantize_ops
scripts (#3839) - [New] add rowwise scaling support (#3822)
- [Improvement] update to tune for small
m
s and quantized gemv (#3712) - [New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
- [New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
- [Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
- [Fix] fix volatile synchronization with acquire/relax (#3728)
- [Improvement] Force determinism by unswizzle (#3727)
- [New] add fp8 kv nope (#3786)
- [Improvement] move common op to vector utils (#3759)
- [Improvement] Gather/Scatter. (#3743)
- [Improvement] reduce scatter supports last dim (#3726)
- [Improvement] Add custom reduce scatter to llama_comms (#3730)
- [New] Adds shapes information to enable torch.compile. (#3724)
- [Improvement] avoid propagation of NaN (#3723)
- [New]
torch.ops.fbgemm.scatter_add_along_first_dim
.. (#3720) - [New]
torch.ops.fbgemm.gather_along_first_dim
. (#3719) - [New] Paged Attention Support (#3698)
- [New] custom reduce scatter (#3686)
- [Fix] Recover custom collective test (#3687)
- [Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
- [New] add fp8fp8 fast_gemv_quantized (#3677)
- [New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
- [Improvement] adjust interface (#3669)
- [Improvement] CK MoE: cherry-pick #1808 (#3609)
- [Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
- [Improvement] Return if no data to allreduce (#3586)
- [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
- [Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
- [New] Add nccl_alltoall function (#3551)
- [New] Add fused_moe kernel to ck_extension (#3518)
GEMM
- [Improvement] Update cutlass verison to 3.8V2 (#3772)
- [Improvement] Update Cutlass to V3.8-2 (#3767)
- [Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
- [New] new tuning for fp8 rowwise (#3756)
- [Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
- [Improvement] Enable DeepGEMM in quantize bench (#3745)
- [Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
- [Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
- [Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
- [Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
- [Improvement] Update CK (#3701)
- [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
- [New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
- [Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
- [New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
- [Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
- [Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
- [New] Modernize bf16 cutlass grouped gemm (#3889)
- [Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
- [New] Add DEEPGEMM Masked API. (#3949)
- [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
- [Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
- [New] BF16I4 Preshuffled Grouped Gemm (#3917)
- [New] Preshuffled BF16I4 Gemm Kernel (#3913)
- [New] Enable rowwise scaling for DeepGemm (#3874)
- [New] bf16 stacked group gemm (#3888)
- [New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)
FP8
- [Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
- [Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
- [Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
- [Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
- [New] Enable FP8 Triton dequantized block-wise kernel (#3788)
- [Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
- [Improvement] changing config for fp8 gemm (#3668)
- [Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
- [New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
- [Improvement] write_k_back for fp8 ROPE (#3679)
- [Improvement] Moves utility functions into a standalone file. (#3671)
- [Fix] Fix f8f8bf16_lite quantize op input in
quantize_and_compute
(#3667) - [Improvement] Optimize zero fill (#3666)
- [Improvement] FP8 Grouped Gemm Optimization (#3655)
- [New] Add sweep_utils.py script to tune heuristics (#3656)
- [Improvement] loose unit test
atol
rtol
tolerance to eliminate ut flakiness (#3664) - [New] Port oss f16_fast_gemv into fbcode (#3610)
- [New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
- [Improvement] k_norm in rope for fp8 kv cache (#3633)
- [Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
- [Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
- [Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
- [Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
- [New] Support FP8 grouped GEMM with rowwise scailing (#3560)
- [Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
- [Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
- [Improvement] amd fp8 rowwise batched gemm tuning (#3624)
- [Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
- [New] amd fp8 rowwise gemm prefill shape tuning (#3607)
- [New] Enable fast FP8 GEMM for memory bound (resubmit) (#3608)
- [Improvement] Make zero_start_index_M optional for dynamic FP8 grouped gemm on AMD (#3604)
- [Improvement] Enable fast FP8 GEMM for memory bound (#3577)
- [Improvement] more fp8 tuning for decode and not need to pad (#3576)
- [Improvement] Enable fast FP8 GEMM for memory bound (#3577)
Triton
- [Improvement] Uses
FastAccum=True
by default for Triton GroupedGEMM. (#3919) - [Improvement] Handle 0 inputs for gmm (#3901)
- [New] Triton GroupedGEMM. WS. (#3912)
- [Improvement] No recompilation caused by varying sequence lengths. (#3903)
- [Improvement] Enable bufferops for non-persistent fp8 rowwise GEMM (#3898)
- [Improvement] Makes
use_fast_accum
configurable. (#3829) - [Fix] Fix triton group gemm for tp4 (#3762)
- [Improvement] Reduce tuning. (#3754)
- [Improvement] [fbgemm_gpu] Upgrade Triton to latest (#3736)
- [New] GroupedGEMM for AMD. (#3729)
- [Improvement] GroupedGEMM interface takes
m_sizes
instead ofm_offsets
. (#3696) - [Fix] Numerical Fix. (#3688)
- [New] Adds Triton based GroupedGEMM implementation. (#3674)
- [Improvement] Add optional zero_start_index_M argument to triton fp8 rowwise quantization (#3628)
- [Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)
TBE
TBE GPU
- [Improvement] Fix flaky TBE unit tests (#3938)
- [Fix] Fix get_infos_metadata meta dispatch (#3946)
- [Improvement] Change set_learning_rate_tensor (#3945)
- [Improvement] Cleanups to
StochasticRoundingRNGState
(#3922) - [New] Unifying TBE API using List (Frontend) - reland (#3821)
- [Improvement] Add tests for bounds_check_indices v2 (#3920)
- [Improvement] Use bounds_check_indices v2 on ROCm (#3916)
- [Fix] Partial revert D70855331 (#3925)
- [Fix] Add a workaround for stochastic rounding for AMD GPUs (#3908)
- [New] AdagradW (fbgemm frontend) (#3850)
- [New] AdagradW (fbgemm backend) (#3827)
- [Fix] Fix IMA in TBE grad indices kernel for int32 indices (#3877)
- [Improvement] Use PackedAccessor64 for index_remappings in pruned_array_lookup (#3870)
- [Improvement] Add overflow_safe_int_t for addressing the int overflow problem (#3875)
- [Improvement] Add torch.jit.script to unit tests (#3869)
- [Improvement] Replace LR access with wrapper (#3849)
- [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/bench/verify_fp16_stochastic_benchmark.cu +10 (#3845)
- [Improvement] Allow FBGEMM_TBE_BOUNDS_CHECK_MODE to take effect when using mode 4,5,6 (#3838)
- [Improvement] replace device param with bounds_check_warning of inputs_to_device function (#3831)
- [Improvement] Packed bag parameters tuning (#3805)
- [Improvement] Symintify max_B and max_D (#3807)
- [Improvement] make lazy init tunable (#3811)
- [Improvement] Log feature gate statuses in TBE init (#3792)
- [Fix] Backout Unifying TBE API using List (Frontend) (#3803)
- [Improvement] Migrate TBE benchmark utilities over to TBE, pt 5b (#3802)
- [Improvement] Migrate TBE benchmark utilities over to TBE, pt 4 (#3794)
- [Fix] fix bounds check v2 mode with vbe input (#3758)
- [Improvement] Migrate TBE benchmark utilities over to TBE, pt 3 (#3785)
- [Fix] Fix prev_iter (#3784)
- [Improvement] Migrate TBE benchmark utilities over to TBE, pt 2 (#3783)
- [Improvement] Enable int32_t support for reshape_vbe_offsets (#3782)
- [Improvement] Migrate TBE benchmark utilities over to TBE (#3781)
- [New] Unifying TBE API using List (Frontend) (#3711)
- [Improvement] Implement inference bag packing along D (#3541) (#3771)
- [Improvement] Add test for VBE CPU (#3778)
- [Improvement] Change the TBE bounds check to match the TBE implementation. (#3773)
- [Improvement] Implement generate_vbe_metadata cpu (#3715)
- [New] Compute
info_B_num_bits
from T to make it a constant (#3748) - [Improvement] Move
execute_backward_adagrad
into a class (#3744) - [Improvement] Add more helper methods for TBE benchmarking (#3747)
- [Improvement] Move
execute_backward_adagrad
into a class (#3744) - [Improvement] Add barrier to test regression hypothesis (#3741)
- [Improvement] annotate tensors in schema for PT2 interface (#3738)
- [Fix] Create cpu iterator irrespective of optimizer choice (#3689)
- [Fix] Fix the TBE cache_precision to fp32 when on ROCm (#3672)
- [New] Unifying TBE API using List (Backend) (#3563)
- [New] Updating split_table_batched_embeddings_ops_training.py (#3613)
- [New] Support INT4 Dequant onto GPU for Seq INT TBE look up (#3584)
- [Fix] Fix calling
numel
on symbolic shapes issue (#3621) - [Fix] fix pre_iter fp32 inaccuracy issue (#3623)
- [Improvement] directly pass update_util as int flag without syncing iter (#3602)
- [New] : basic tbe input dump framework (#3593)
- [New] Add support for
int32_t
indices in TBE training (2K/N) (#3583) - [New] Add support for
int32_t
indices in TBE training (2H/N) (#3539) - [Misc] Enable v2 forward test for ROCm (#3573)
- [Fix] Fix bug in ROCm optimized forward pass (#3599)
- [New] Add support for
int32_t
indices in TBE training (2I/N) (#3556) - [New] Add support for
int32_t
indices in TBE training (2F/N) (#3376) - [Fix] Back out "Optimzed backward pass for ROCm devices (pt 2)" (#3587)
- [Fix] Revert D65620886 (#3582)
- [New] Add support for
int32_t
indices in TBE training (2G/N) (#3377) - [New] Add new optimizer state
row_counter
for Adam [Frontend] (#3558) - [New] Add support for
int32_t
indices in TBE training (2E/N) (#3375) - [Improvement] Do not call
scalar_type
(#3394) - [Fix] Remove torch.jit.script (#3562)
- [New] Add support for
int32_t
indices in TBE training (2D/N) (#3374) - [New] Add support for
int32_t
indices in TBE training (3/N) (#3372) - [New] Add support for
int32_t
indices in TBE training (2B/N) (#3371) - [Improvement] Optimzed backward pass for ROCm devices (pt 2) (#3511)
TBE SSD
- [Improvement] Reduce bulk init time and fix OOM (#3828)
- [Improvement] Engage KVTensor aware checkpoint load paths. (#3718)
- [Fix] uncomment accidently commented out unittest (#3716)
- [Improvement] sync wait before L1 and L2 flush (#3709)
- [Improvement] return right away if keys is empty (#3658)
- [Improvement] Adding couple more APIs to KVTensorWrapper to bring partiy with torch::Tensor (#3645)
- [Improvement] Move embedding_rocksdb_wrapper to its own header. (#3622)
- [Fix] Fix autodeps for torch/custom_class.h and use it in kv_tensor_wrapper_cpu (#3600)
- [Improvement] put KVTensorWrapper in its own header (#3575)
Other Ops
Inplace Ops
- [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/src/embedding_inplace_ops/embedding_inplace_update.cu +10 (#3846)
Permute Ops
- [New] support permute_multi_embedding_function on torch.export (#3897)
- [Improvement] do not call permute on empty tensor (#3705)
Quantize Ops
- [New] implement packed quantize row / dequantize row API (#3915)
- [New] Add small M support (#3682)
- [Improvement] Provide helper functions for int4 quantization (#3775)
- [Improvement] test fp8fp8bf16/bf16fp8bf16_fast_gemv is torch compileable (#3809)
- [Improvement] Eliminate MemCpyDtoH overhead for quantized fast_gemv kernel (#3725)
- [Improvement] Add abstract impl for Fused8BitRowwiseQuantizedToFloatOrHalf et al. (#715) (#3640)
- [New] Fold ops registration code (#3634)
- [Fix] Fix type to assert group size (#3601)
Sparse Ops
- [Improvement] nested dispatching of segment_csr on cpu/gpu (#3881)
- [Improvement] Fix faketensor error when dev_weights is undefined. (#3755)
- [Improvement] Set default value to null (#3732)
- [New] Support histogram_binning_calibration for export (#3657)
- [Fix] fix data type of block_bucketize_pos in block_bucketize_sparse_features (#3589)
- [Fix] Fix specailization issue in keyed_jagged_index_select_dim1_forward_cuda (#3578)
SLL Ops
- [Improvement] Re-organize SLL ops, pt 9 (#3665)
- [Improvement] Re-organize SLL ops, pt 8 (#3663)
- [Improvement] Re-organize SLL ops, pt 7 (#3650)
- [Improvement] Re-organize SLL ops, pt 6 (#3647)
- [Improvement] Re-organize SLL ops, pt 5 (#3646)
- [Improvement] Re-organize SLL ops, pt 4 (#3644)
- [Improvement] Re-organize SLL ops, pt 3 (#3652)
- [Improvement] Re-organize SLL ops, pt 2 (#3643)
- [Improvement] Re-organize SLL ops, pt 1 (#3642)
- [Improvement] Fold ops registration code, pt 3 (#3641)
- [Improvement] Fold ops registration code, pt 2 (#3635)
Benchmarks
- [Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)
- [New] Enable multi-processing in CPU TBE micro-benchmarks (#3753)
- [Improvement] Improve VBE benchmark (#3867)
- [Improvement] Clean up stochastic rounding benchmarks (#3876)
- [Fix] Fix EEG indices estimator op (#3852)
- [New] Expose EEG indices estimation to Python (#3836)
- [Improvement] Support MTIA for device and device-with-spec (#3832)
- [Improvement] fix benchmark logging after script reorganization (#3819)
- [Improvement] Cleanups for the EEG-based TBE benchmark CLI, pt 2 (#3815)
- [New] [fbgemm_gpu] Add benchmark workflows (#3713)
- [Improvement] Add cache-precision arg to TBE device_with_spec bench (#3791)
- [New] Migrate EEG-based TBE benchmark code to OSS (#3780)
- [New] Migrate TBE EEG Python code to OSS (#3774)
- [New] Migrate TBE EEG C++ code to OSS (#3765)
- [New] Add
TBEDataConfig
to Python side (#3739) - [Improvement] Add Quantize benchmark (#3706)
- [Improvement] Adding iterations to benchmark script (#3708)
- [Improvement] Small modifications to quantize_bench script (#3684)
- [Improvement] Add option to set cache precision in TBE benchmark (#3659)
- [Improvement] Add tracing option to quantize bench (#3651)
- [Improvement] Add preprocess stage to quantize bench operators (#3648)
- [Improvement] Pre-convert indices/offsets in TBE bench (#3595)
- [Improvement] Allow reusing input data in TBE benchmark (#3594)
- [Improvement] Profile with kineto and warmup for more accurate benchmarking (#3580) (#3585)
Better Engineering
Builds
- [Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
- [New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
- [New] [AMD] Add CK to dependencies to enable AMD build. (#3929)
- [Fix] [fbgemm_gpu] Fix Nova package labeling for GenAI (#3933)
- [Fix] [fbgemm_gpu] Update Nova jobs (#3890)
- [Improvement] update hipify_torch (#3918)
- [Improvement] [fbgemm_gpu] Fix setup scripts for OSS ROCm (#3909)
- [Fix] [fbgemm_gpu] Fix undefined symbol error (#3900)
- [Improvement] Add FB python sources into genai CMakeLists.txt (#3886)
- [Improvement] [fbgemm_gpu] Update CMakeLists.txt for experimental/genai (#3872)
- [Improvement] [fbgemm_gpu] Increase timeout for Nova jobs (#3871)
- [New] [fbgemm_gpu] Update Nova CI configuration to support B200 (#3868)
- [Fix] [fbgemm_gpu] Fix bash line to work with macOS builds (#3863)
- [Improvement] Add option to set build parallelism in OSS workflows (#3859)
- [New] [fbgemm_gpu] Support newer CUDA architectures in OSS (#3848)
- [Improvement] [fbgemm_gpu] Increase CUDA test timeout (#3810)
- [Fix] [fbgemm] Fix compilation issues with GCC 14.1 (#3804)
- [Improvement] [fbgemm_gpu] Limit the number of ROCm hardware targets (#3797)
- [Improvement] Fix clang vla warning (#2736)
- [Fix] [fbgemm_gpu] Fix PT2 wrapper registrations (#3721)
- [Fix] [fbgemm_gpu] Fix python docs not being visible (#3717)
- [New] [fbgemm_gpu] Nova job update to support building against CUDA 12.8 (#3704)
- [New] [fbgemm_gpu] Add CUDA 12.8 build support (#3700)
- [Improvement] [fbgemm_gpu] Test genai op registration (#3692)
- [Improvement] [fbgemm_gpu] Break down
fbgemm_gpu_tbe_training_backward
module further, pt 3 (#3694) - [Improvement] [fbgemm_gpu] Save built docs as GHA artifact (#3695)
- [Improvement] Rename sources to avoid internal build issue (#3697)
- [Improvement] [fbgemm_gpu] Break down CMake module further, pt 2 (#3681)
- [Fix] Fix linting CI error introduced in D69213404 (#3683)
- [Improvement] [fbgemm_gpu] Break down CMake module further (#3673)
- [New] [fbgemm_gpu] GitHub PR scraper (#3661)
- [New] [fbgemm_gpu] Add macro support for NVCC and HIPCC specific flags (#3636)
- [Improvement] add AMD specific includes in cuda_prelude.h (#3614)
- [Fix] [fbgemm_gpu] Fix CMakeLists.txt for experimental/gemm (#3592)
- [Improvement] [fbgemm_gpu] Upgrade GitHub actions (#3581)
- [Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
- [Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
- [New] [fbgemm_gpu] Add build support for AMD MI300 (#3566)
- [Improvement] [fbgemm_gpu] Expand test timeout for ROCm pip install workflow (#3557)
- [Improvement] [fbgemm_gpu] Update triton version for OSS (#3555)
- [Fix] [fbgemm_gpu] Fix versioning scheme for ROCm releases (#3554)
- [Misc] [fbgemm_gpu] Properly disable TBE SSD tests in OSS (#3548)
Documentation
- [New] [fbgemm_gpu] Add docs for GenAI package (#3905)
- [Improvement] [fbgemm_gpu] Update docs (#3891)
- [Misc] Add comments to TBE inference PackedMode (#3789)
- [Misc] [fbgemm_gpu] Update ROCm and CUDA versions in docs (#3569)
- [New] [fbgemm_gpu] Add documentation for Feature Gates (#3740)
- [Improvement] Remove erroneous comment (#3733)
- [Fix] [fbgemm_gpu] Minor doc fix (#3618)
Utils
- [New] Better kernel launch utilities (#3914)
- [Improvement] Add ability to save and load data into HostDeviceBufferPair (#3899)
- [New] Add abstractions for writing out data (flesh out D71147675, pt 1) (#3856)
- [New] Add feature gate for HIP-based backward kernel (#3835)
- [Misc] Optionally use env vars for config lookup (#3795)
- [Fix] Updates and fixes to tensor_accessor.h (#3571)