Skip to content

FBGEMM v1.2.0 Release Notes

Latest
Compare
Choose a tag to compare
@q10 q10 released this 27 Apr 08:31
· 73 commits to main since this release

Highlights

TBE GPU

  • Added support for int64_t table indices and offsets in TBE inference
  • Improved TBE benchmark utilities with the introduction of the Embeddings Estimator and Generator (EEG)

TBE CPU

  • Added Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf operator
  • Make FloatToFloat16 conversion 75x faster using SVE2 instructions
  • Added FP32 GEMM kernels

TBE SSD

  • Fix OOM issues during init
  • Improvements to L1 and L2 flush

Gen AI Ops

  • GenAI ops are now separately packaged into FBGEMM GenAI package for easier build and installation
  • Various FP8 grouped GEMM optimizations
  • BF16I4 preshuffled grouped GEMM
  • BF16 stacked grouped GEMM
  • F8I4 grouped GEMM optimizations
  • Added nccl_alltoall function

ROCm

  • Added preliminary ROCm OSS build support for GenAI ops

Better Engineering

  • Added build support for CUDA 12.8
  • Introduced a set of utilities to harden CUDA kernel launches against a set of runtime errors

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

  • PyTorch: v2.7
  • CUDA: v11.8, 12.6, 12.8
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU (instructions here) and FBGEMM-GenAI (instructions here).

Availability

FBGEMM_GPU and FBGEMM GenAI can be fetched directly from PyPI:

# FBGEMM_GPU - CUDA (only the CUDA 12.6 variant is available)
pip install fbgemm-gpu==1.2.0

# FBGEMM_GPU - CPU
pip install fbgemm-gpu-cpu==1.2.0

# FBGEMM GenAI
pip install fbgemm-gpu-genai==1.2.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU - CUDA
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu126/
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cu128/

# FBGEMM_GPU - CPU
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

# FBGEMM GenAI 
pip install fbgemm-gpu==1.2.0 --index-url https://download.pytorch.org/whl/cpu

Changes

CPU

GEMM

  • [Improvement] Improve Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon by 5%-15% (#3860)
  • [New] Use enum to select floating point format in FbgemmEmbedding APIs (#3842)
  • [New] Add generic IEEE754 truncation code (#3820)
  • [New] Enable KleidiAI for FP32 (#3818)
  • [Improvement] Move float conversion functions from Types.h into new FloatConversion.h (#3760)
  • [Fix] Use kleidiAI on static builds (#3806)
  • [Fix] Fix KleidiAI FP16 (#3769)
  • [Improvement] Pull ARM's matrix transpose PR (#3660)
  • [New] Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf (#3707)
  • [Improvement] avoid extra copy in PackedGemmMatrixB constructor (#3691)
  • [Improvement] Remove FENV pragma (#3629)
  • [Improvement] Make FloatToFloat16 conversion 75x faster using SVE2 instructions (#3626)
  • [New] add a new constructor to PackedGemmMatrixB (#3598)
  • [New] Move FP32 kernels to OSS (#3568)

GenAI

GenAI Ops

  • [Improvement] Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (#3790) (#3942)
  • [New] Add harness for comms benchmark (#3936)
  • [Improvement] Refactoring of NoPE (#3840)
  • [Improvement] support fp16 dtypes for input weight and bias (#3931)
  • [Fix] fix fp8 kv cache dequantize kernels (#3896)
  • [Fix] fix fp8 kv cache dequantize kernels (#3896)
  • [Improvement] scatter_add 0 size support (#3861)
  • [Improvement] Retuned CK GMM fp8/bf16 with perf fixes (#3851)
  • [Improvement] Enable groupwise scales for F8I4 Grouped Gemm (#3884)
  • [Fix] Fix empty input view. (#3880)
  • [New] FP8 Rowwise Dequant Kernel (#3873)
  • [New] torch.ops.fbgemm.gather_scale_dense_tokens for oss. (#3855)
  • [Improvement] Replace rms_norm as norm (#3841)
  • [Improvement] Move DeepGemm scale transpose to quantize (#3834)
  • [Improvement] follow up to reflect rowwise scale inputs for x, w in quantize_ops scripts (#3839)
  • [New] add rowwise scaling support (#3822)
  • [Improvement] update to tune for small ms and quantized gemv (#3712)
  • [New] Add Preshuffled FP8 x INT4 Grouped Gemm Kernel (#3800)
  • [New] FBGEMM Add Columnwise Weight Scaling to F8I4 GEMM (#3766)
  • [Improvement] update the sorting kernel for bf16 ck fmoe kernel (#3817)
  • [Fix] fix volatile synchronization with acquire/relax (#3728)
  • [Improvement] Force determinism by unswizzle (#3727)
  • [New] add fp8 kv nope (#3786)
  • [Improvement] move common op to vector utils (#3759)
  • [Improvement] Gather/Scatter. (#3743)
  • [Improvement] reduce scatter supports last dim (#3726)
  • [Improvement] Add custom reduce scatter to llama_comms (#3730)
  • [New] Adds shapes information to enable torch.compile. (#3724)
  • [Improvement] avoid propagation of NaN (#3723)
  • [New] torch.ops.fbgemm.scatter_add_along_first_dim.. (#3720)
  • [New] torch.ops.fbgemm.gather_along_first_dim. (#3719)
  • [New] Paged Attention Support (#3698)
  • [New] custom reduce scatter (#3686)
  • [Fix] Recover custom collective test (#3687)
  • [Improvement] update sweep_utils.py to test more precision gemv kernel (#3678)
  • [New] add fp8fp8 fast_gemv_quantized (#3677)
  • [New] add mixed precision fp8 fast_gemv_quantized kernel (#3675)
  • [Improvement] adjust interface (#3669)
  • [Improvement] CK MoE: cherry-pick #1808 (#3609)
  • [Improvement] fix llm shapes in quantize bench and add ldm shapes (#3611)
  • [Improvement] Return if no data to allreduce (#3586)
  • [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
  • [Improvement] Make zero_start_index_M optional for dynamic BF16 Grouped Gemm (#3553)
  • [New] Add nccl_alltoall function (#3551)
  • [New] Add fused_moe kernel to ck_extension (#3518)

GEMM

  • [Improvement] Update cutlass verison to 3.8V2 (#3772)
  • [Improvement] Update Cutlass to V3.8-2 (#3767)
  • [Improvement] fp8_gemm (non_persistent): adding optimal configs for 8k & 16k shapes (#3764)
  • [New] new tuning for fp8 rowwise (#3756)
  • [Improvement] Add DeepGEMM blockwise GEMM in quantize bench (#3746)
  • [Improvement] Enable DeepGEMM in quantize bench (#3745)
  • [Improvement] reduce overhead for f8f8bf16_rowwise_grouped_dynamic on amd (#3742)
  • [Improvement] Performance Optimization: Optimized TileShape Configuration for f8 (#3617) (#3735)
  • [Improvement] Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) (#3710)
  • [Improvement] adding an option to skip zeroing output tensor for f8f8bf16_rowwise_grouped_dynamic (#3685)
  • [Improvement] Update CK (#3701)
  • [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu +10 (#3844)
  • [New] Make F8I4 grouped GEMM process M_sizes with INT32 (#3853)
  • [Improvement] Skip empty groups in FP8 Stacked Gemm (#3862)
  • [New] Enable preshuffled mixed dtype Cutlass Gemm (#3722)
  • [Improvement] [CUTLASS] Minor Cutlass change to fix CI (#3779)
  • [Improvement] Clean up cutlass FP8 Grouped Gemm Kernel Setup (#3864)
  • [New] Modernize bf16 cutlass grouped gemm (#3889)
  • [Improvement] [CUTLASS] Include new cutlass support for groupwise mixed dtype grouped gemm. (#3885)
  • [New] Add DEEPGEMM Masked API. (#3949)
  • [Improvement] Use Int64 Indexing in Grouped Gemm (#3930)
  • [Improvement] Add correctness testing for shuffled mixed dtype GEMMs. (#3932)
  • [New] BF16I4 Preshuffled Grouped Gemm (#3917)
  • [New] Preshuffled BF16I4 Gemm Kernel (#3913)
  • [New] Enable rowwise scaling for DeepGemm (#3874)
  • [New] bf16 stacked group gemm (#3888)
  • [New] F8I4 Grouped Gemm Optimization for Sparse M (#3854)

FP8

  • [Fix] FBGEMM fp8 ck GEMM fix for irregular GEMM shapes (#3894)
  • [Fix] fix stacked version fp8 rowwise group gemm registration in quantize_bench (#3902)
  • [Fix] A hotfix for FBGEMM fp8 rowwise with irregular gemm sizes (#3883)
  • [Improvement] Transpose FP8 GEMM inputs for better tuning (#3866)
  • [New] Enable FP8 Triton dequantized block-wise kernel (#3788)
  • [Improvement] Refactor stacked version of FP8 Grouped Gemm for reduced overhead (#3699)
  • [Improvement] changing config for fp8 gemm (#3668)
  • [Improvement] Add option to disable fast_accumulation for fp8 gemm. (#3714)
  • [New] Add cublas FP8 tensorwise GEMM in fbgemm quantize bench (#3693)
  • [Improvement] write_k_back for fp8 ROPE (#3679)
  • [Improvement] Moves utility functions into a standalone file. (#3671)
  • [Fix] Fix f8f8bf16_lite quantize op input in quantize_and_compute (#3667)
  • [Improvement] Optimize zero fill (#3666)
  • [Improvement] FP8 Grouped Gemm Optimization (#3655)
  • [New] Add sweep_utils.py script to tune heuristics (#3656)
  • [Improvement] loose unit test atol rtol tolerance to eliminate ut flakiness (#3664)
  • [New] Port oss f16_fast_gemv into fbcode (#3610)
  • [New] fp8 rowwise regular gemm tuning for llm new shapes (#3654)
  • [Improvement] k_norm in rope for fp8 kv cache (#3633)
  • [Improvement] Fix zero_start_index_M argument for triton rowwise quantize (#3639)
  • [Fix] Fix handling of dynamic FP8 grouped gemm on Nvidia (#3616)
  • [Improvement] Improve FP8 grouped GEMM perf via tileshape and cooperative (#3653)
  • [Improvement] Refactor FP8 grouped GEMM with dynamic and static versions (#3561)
  • [New] Support FP8 grouped GEMM with rowwise scailing (#3560)
  • [Fix] [CUTLASS] Use custom copy of cutlass to enable FP8 Grouped Gemm. (#3649)
  • [Fix] kv_dq zero initialization to avoid NaNs from FA3 (#3632)
  • [Improvement] amd fp8 rowwise batched gemm tuning (#3624)
  • [Improvement] Improve handling for FP8 grouped gemm without zero_start_index_M (#3615)
  • [New] amd fp8 rowwise gemm prefill shape tuning (#3607)
  • [New] Enable fast FP8 GEMM for memory bound (resubmit) (#3608)
  • [Improvement] Make zero_start_index_M optional for dynamic FP8 grouped gemm on AMD (#3604)
  • [Improvement] Enable fast FP8 GEMM for memory bound (#3577)
  • [Improvement] more fp8 tuning for decode and not need to pad (#3576)
  • [Improvement] Enable fast FP8 GEMM for memory bound (#3577)

Triton

  • [Improvement] Uses FastAccum=True by default for Triton GroupedGEMM. (#3919)
  • [Improvement] Handle 0 inputs for gmm (#3901)
  • [New] Triton GroupedGEMM. WS. (#3912)
  • [Improvement] No recompilation caused by varying sequence lengths. (#3903)
  • [Improvement] Enable bufferops for non-persistent fp8 rowwise GEMM (#3898)
  • [Improvement] Makes use_fast_accum configurable. (#3829)
  • [Fix] Fix triton group gemm for tp4 (#3762)
  • [Improvement] Reduce tuning. (#3754)
  • [Improvement] [fbgemm_gpu] Upgrade Triton to latest (#3736)
  • [New] GroupedGEMM for AMD. (#3729)
  • [Improvement] GroupedGEMM interface takes m_sizes instead of m_offsets. (#3696)
  • [Fix] Numerical Fix. (#3688)
  • [New] Adds Triton based GroupedGEMM implementation. (#3674)
  • [Improvement] Add optional zero_start_index_M argument to triton fp8 rowwise quantization (#3628)
  • [Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)

TBE

TBE GPU

  • [Improvement] Fix flaky TBE unit tests (#3938)
  • [Fix] Fix get_infos_metadata meta dispatch (#3946)
  • [Improvement] Change set_learning_rate_tensor (#3945)
  • [Improvement] Cleanups to StochasticRoundingRNGState (#3922)
  • [New] Unifying TBE API using List (Frontend) - reland (#3821)
  • [Improvement] Add tests for bounds_check_indices v2 (#3920)
  • [Improvement] Use bounds_check_indices v2 on ROCm (#3916)
  • [Fix] Partial revert D70855331 (#3925)
  • [Fix] Add a workaround for stochastic rounding for AMD GPUs (#3908)
  • [New] AdagradW (fbgemm frontend) (#3850)
  • [New] AdagradW (fbgemm backend) (#3827)
  • [Fix] Fix IMA in TBE grad indices kernel for int32 indices (#3877)
  • [Improvement] Use PackedAccessor64 for index_remappings in pruned_array_lookup (#3870)
  • [Improvement] Add overflow_safe_int_t for addressing the int overflow problem (#3875)
  • [Improvement] Add torch.jit.script to unit tests (#3869)
  • [Improvement] Replace LR access with wrapper (#3849)
  • [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/bench/verify_fp16_stochastic_benchmark.cu +10 (#3845)
  • [Improvement] Allow FBGEMM_TBE_BOUNDS_CHECK_MODE to take effect when using mode 4,5,6 (#3838)
  • [Improvement] replace device param with bounds_check_warning of inputs_to_device function (#3831)
  • [Improvement] Packed bag parameters tuning (#3805)
  • [Improvement] Symintify max_B and max_D (#3807)
  • [Improvement] make lazy init tunable (#3811)
  • [Improvement] Log feature gate statuses in TBE init (#3792)
  • [Fix] Backout Unifying TBE API using List (Frontend) (#3803)
  • [Improvement] Migrate TBE benchmark utilities over to TBE, pt 5b (#3802)
  • [Improvement] Migrate TBE benchmark utilities over to TBE, pt 4 (#3794)
  • [Fix] fix bounds check v2 mode with vbe input (#3758)
  • [Improvement] Migrate TBE benchmark utilities over to TBE, pt 3 (#3785)
  • [Fix] Fix prev_iter (#3784)
  • [Improvement] Migrate TBE benchmark utilities over to TBE, pt 2 (#3783)
  • [Improvement] Enable int32_t support for reshape_vbe_offsets (#3782)
  • [Improvement] Migrate TBE benchmark utilities over to TBE (#3781)
  • [New] Unifying TBE API using List (Frontend) (#3711)
  • [Improvement] Implement inference bag packing along D (#3541) (#3771)
  • [Improvement] Add test for VBE CPU (#3778)
  • [Improvement] Change the TBE bounds check to match the TBE implementation. (#3773)
  • [Improvement] Implement generate_vbe_metadata cpu (#3715)
  • [New] Compute info_B_num_bits from T to make it a constant (#3748)
  • [Improvement] Move execute_backward_adagrad into a class (#3744)
  • [Improvement] Add more helper methods for TBE benchmarking (#3747)
  • [Improvement] Move execute_backward_adagrad into a class (#3744)
  • [Improvement] Add barrier to test regression hypothesis (#3741)
  • [Improvement] annotate tensors in schema for PT2 interface (#3738)
  • [Fix] Create cpu iterator irrespective of optimizer choice (#3689)
  • [Fix] Fix the TBE cache_precision to fp32 when on ROCm (#3672)
  • [New] Unifying TBE API using List (Backend) (#3563)
  • [New] Updating split_table_batched_embeddings_ops_training.py (#3613)
  • [New] Support INT4 Dequant onto GPU for Seq INT TBE look up (#3584)
  • [Fix] Fix calling numel on symbolic shapes issue (#3621)
  • [Fix] fix pre_iter fp32 inaccuracy issue (#3623)
  • [Improvement] directly pass update_util as int flag without syncing iter (#3602)
  • [New] : basic tbe input dump framework (#3593)
  • [New] Add support for int32_t indices in TBE training (2K/N) (#3583)
  • [New] Add support for int32_t indices in TBE training (2H/N) (#3539)
  • [Misc] Enable v2 forward test for ROCm (#3573)
  • [Fix] Fix bug in ROCm optimized forward pass (#3599)
  • [New] Add support for int32_t indices in TBE training (2I/N) (#3556)
  • [New] Add support for int32_t indices in TBE training (2F/N) (#3376)
  • [Fix] Back out "Optimzed backward pass for ROCm devices (pt 2)" (#3587)
  • [Fix] Revert D65620886 (#3582)
  • [New] Add support for int32_t indices in TBE training (2G/N) (#3377)
  • [New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
  • [New] Add support for int32_t indices in TBE training (2E/N) (#3375)
  • [Improvement] Do not call scalar_type (#3394)
  • [Fix] Remove torch.jit.script (#3562)
  • [New] Add support for int32_t indices in TBE training (2D/N) (#3374)
  • [New] Add support for int32_t indices in TBE training (3/N) (#3372)
  • [New] Add support for int32_t indices in TBE training (2B/N) (#3371)
  • [Improvement] Optimzed backward pass for ROCm devices (pt 2) (#3511)

TBE SSD

  • [Improvement] Reduce bulk init time and fix OOM (#3828)
  • [Improvement] Engage KVTensor aware checkpoint load paths. (#3718)
  • [Fix] uncomment accidently commented out unittest (#3716)
  • [Improvement] sync wait before L1 and L2 flush (#3709)
  • [Improvement] return right away if keys is empty (#3658)
  • [Improvement] Adding couple more APIs to KVTensorWrapper to bring partiy with torch::Tensor (#3645)
  • [Improvement] Move embedding_rocksdb_wrapper to its own header. (#3622)
  • [Fix] Fix autodeps for torch/custom_class.h and use it in kv_tensor_wrapper_cpu (#3600)
  • [Improvement] put KVTensorWrapper in its own header (#3575)

Other Ops

Inplace Ops

  • [Fix] Fix CUDA kernel index data type in deeplearning/fbgemm/fbgemm_gpu/src/embedding_inplace_ops/embedding_inplace_update.cu +10 (#3846)

Permute Ops

  • [New] support permute_multi_embedding_function on torch.export (#3897)
  • [Improvement] do not call permute on empty tensor (#3705)

Quantize Ops

  • [New] implement packed quantize row / dequantize row API (#3915)
  • [New] Add small M support (#3682)
  • [Improvement] Provide helper functions for int4 quantization (#3775)
  • [Improvement] test fp8fp8bf16/bf16fp8bf16_fast_gemv is torch compileable (#3809)
  • [Improvement] Eliminate MemCpyDtoH overhead for quantized fast_gemv kernel (#3725)
  • [Improvement] Add abstract impl for Fused8BitRowwiseQuantizedToFloatOrHalf et al. (#715) (#3640)
  • [New] Fold ops registration code (#3634)
  • [Fix] Fix type to assert group size (#3601)

Sparse Ops

  • [Improvement] nested dispatching of segment_csr on cpu/gpu (#3881)
  • [Improvement] Fix faketensor error when dev_weights is undefined. (#3755)
  • [Improvement] Set default value to null (#3732)
  • [New] Support histogram_binning_calibration for export (#3657)
  • [Fix] fix data type of block_bucketize_pos in block_bucketize_sparse_features (#3589)
  • [Fix] Fix specailization issue in keyed_jagged_index_select_dim1_forward_cuda (#3578)

SLL Ops

  • [Improvement] Re-organize SLL ops, pt 9 (#3665)
  • [Improvement] Re-organize SLL ops, pt 8 (#3663)
  • [Improvement] Re-organize SLL ops, pt 7 (#3650)
  • [Improvement] Re-organize SLL ops, pt 6 (#3647)
  • [Improvement] Re-organize SLL ops, pt 5 (#3646)
  • [Improvement] Re-organize SLL ops, pt 4 (#3644)
  • [Improvement] Re-organize SLL ops, pt 3 (#3652)
  • [Improvement] Re-organize SLL ops, pt 2 (#3643)
  • [Improvement] Re-organize SLL ops, pt 1 (#3642)
  • [Improvement] Fold ops registration code, pt 3 (#3641)
  • [Improvement] Fold ops registration code, pt 2 (#3635)

Benchmarks

  • [Fix] [fbgemm_gpu] Fix CPU benchmark scripts (#3941)
  • [New] Enable multi-processing in CPU TBE micro-benchmarks (#3753)
  • [Improvement] Improve VBE benchmark (#3867)
  • [Improvement] Clean up stochastic rounding benchmarks (#3876)
  • [Fix] Fix EEG indices estimator op (#3852)
  • [New] Expose EEG indices estimation to Python (#3836)
  • [Improvement] Support MTIA for device and device-with-spec (#3832)
  • [Improvement] fix benchmark logging after script reorganization (#3819)
  • [Improvement] Cleanups for the EEG-based TBE benchmark CLI, pt 2 (#3815)
  • [New] [fbgemm_gpu] Add benchmark workflows (#3713)
  • [Improvement] Add cache-precision arg to TBE device_with_spec bench (#3791)
  • [New] Migrate EEG-based TBE benchmark code to OSS (#3780)
  • [New] Migrate TBE EEG Python code to OSS (#3774)
  • [New] Migrate TBE EEG C++ code to OSS (#3765)
  • [New] Add TBEDataConfig to Python side (#3739)
  • [Improvement] Add Quantize benchmark (#3706)
  • [Improvement] Adding iterations to benchmark script (#3708)
  • [Improvement] Small modifications to quantize_bench script (#3684)
  • [Improvement] Add option to set cache precision in TBE benchmark (#3659)
  • [Improvement] Add tracing option to quantize bench (#3651)
  • [Improvement] Add preprocess stage to quantize bench operators (#3648)
  • [Improvement] Pre-convert indices/offsets in TBE bench (#3595)
  • [Improvement] Allow reusing input data in TBE benchmark (#3594)
  • [Improvement] Profile with kineto and warmup for more accurate benchmarking (#3580) (#3585)

Better Engineering

Builds

  • [Improvement] [fbgemm_gpu] Reduce OSS build sizes for non-GenAI FBGEMM_GPU (#3948)
  • [New] [fbgemm_gpu] Add Scripts for Generating Release Reports (#3676)
  • [New] [AMD] Add CK to dependencies to enable AMD build. (#3929)
  • [Fix] [fbgemm_gpu] Fix Nova package labeling for GenAI (#3933)
  • [Fix] [fbgemm_gpu] Update Nova jobs (#3890)
  • [Improvement] update hipify_torch (#3918)
  • [Improvement] [fbgemm_gpu] Fix setup scripts for OSS ROCm (#3909)
  • [Fix] [fbgemm_gpu] Fix undefined symbol error (#3900)
  • [Improvement] Add FB python sources into genai CMakeLists.txt (#3886)
  • [Improvement] [fbgemm_gpu] Update CMakeLists.txt for experimental/genai (#3872)
  • [Improvement] [fbgemm_gpu] Increase timeout for Nova jobs (#3871)
  • [New] [fbgemm_gpu] Update Nova CI configuration to support B200 (#3868)
  • [Fix] [fbgemm_gpu] Fix bash line to work with macOS builds (#3863)
  • [Improvement] Add option to set build parallelism in OSS workflows (#3859)
  • [New] [fbgemm_gpu] Support newer CUDA architectures in OSS (#3848)
  • [Improvement] [fbgemm_gpu] Increase CUDA test timeout (#3810)
  • [Fix] [fbgemm] Fix compilation issues with GCC 14.1 (#3804)
  • [Improvement] [fbgemm_gpu] Limit the number of ROCm hardware targets (#3797)
  • [Improvement] Fix clang vla warning (#2736)
  • [Fix] [fbgemm_gpu] Fix PT2 wrapper registrations (#3721)
  • [Fix] [fbgemm_gpu] Fix python docs not being visible (#3717)
  • [New] [fbgemm_gpu] Nova job update to support building against CUDA 12.8 (#3704)
  • [New] [fbgemm_gpu] Add CUDA 12.8 build support (#3700)
  • [Improvement] [fbgemm_gpu] Test genai op registration (#3692)
  • [Improvement] [fbgemm_gpu] Break down fbgemm_gpu_tbe_training_backward module further, pt 3 (#3694)
  • [Improvement] [fbgemm_gpu] Save built docs as GHA artifact (#3695)
  • [Improvement] Rename sources to avoid internal build issue (#3697)
  • [Improvement] [fbgemm_gpu] Break down CMake module further, pt 2 (#3681)
  • [Fix] Fix linting CI error introduced in D69213404 (#3683)
  • [Improvement] [fbgemm_gpu] Break down CMake module further (#3673)
  • [New] [fbgemm_gpu] GitHub PR scraper (#3661)
  • [New] [fbgemm_gpu] Add macro support for NVCC and HIPCC specific flags (#3636)
  • [Improvement] add AMD specific includes in cuda_prelude.h (#3614)
  • [Fix] [fbgemm_gpu] Fix CMakeLists.txt for experimental/gemm (#3592)
  • [Improvement] [fbgemm_gpu] Upgrade GitHub actions (#3581)
  • [Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
  • [Improvement] add patchelf as a required package in fbgemm_gpu/requirements.txt (#3574)
  • [New] [fbgemm_gpu] Add build support for AMD MI300 (#3566)
  • [Improvement] [fbgemm_gpu] Expand test timeout for ROCm pip install workflow (#3557)
  • [Improvement] [fbgemm_gpu] Update triton version for OSS (#3555)
  • [Fix] [fbgemm_gpu] Fix versioning scheme for ROCm releases (#3554)
  • [Misc] [fbgemm_gpu] Properly disable TBE SSD tests in OSS (#3548)

Documentation

  • [New] [fbgemm_gpu] Add docs for GenAI package (#3905)
  • [Improvement] [fbgemm_gpu] Update docs (#3891)
  • [Misc] Add comments to TBE inference PackedMode (#3789)
  • [Misc] [fbgemm_gpu] Update ROCm and CUDA versions in docs (#3569)
  • [New] [fbgemm_gpu] Add documentation for Feature Gates (#3740)
  • [Improvement] Remove erroneous comment (#3733)
  • [Fix] [fbgemm_gpu] Minor doc fix (#3618)

Utils

  • [New] Better kernel launch utilities (#3914)
  • [Improvement] Add ability to save and load data into HostDeviceBufferPair (#3899)
  • [New] Add abstractions for writing out data (flesh out D71147675, pt 1) (#3856)
  • [New] Add feature gate for HIP-based backward kernel (#3835)
  • [Misc] Optionally use env vars for config lookup (#3795)
  • [Fix] Updates and fixes to tensor_accessor.h (#3571)