It is Day 3 of DeepSeek Open Source Week, and the company released its 3rd bomb: DeepGEMM. Following the release of FlashMLA on the first day and DeepEP on the second day, DeepSeek is making powerful open-source contributions to the AI community. DeepGEMM stands out as an efficient FP8 library to make AI computations faster and more efficient. It speeds up General Matrix Multiplications (GEMMs), which is a core operation in AI models. DeepGEMM’s exceptional performance with FP8 operations makes it especially valuable for training and inference of DeepSeek’s V3 and R1 models.
Table of Contents
What Makes DeepGEMM Special?
Most AI models rely heavily on General Matrix Multiplications (GEMMs). The faster and more efficient these calculations are, the quicker and cheaper it is to train and run AI models. DeepGEMM optimizes these operations, making AI training and inference much more efficient.
DeepGEMM borrows some ideas from CUTLASS and CuTe, two well-known NVIDIA tools, but it doesn’t rely too heavily on their complex template structures. That makes it easier to read and modify, which is great if you’re a developer wanting to learn from it.
Key Features of DeepGEMM
1. Optimized for Hopper Architecture
Currently, it exclusively supports NVIDIA Hopper tensor cores. To address the challenges posed by imprecise FP8 tensor core accumulation, it employs advanced CUDA-core two-level accumulation techniques.
2. Blazing-Fast Performance
DeepGEMM is capable of achieving over 1350+ FP8 TFLOPS on Hopper GPUs. It is engineered for speed and efficiency, enabling faster computations for complex models.
3. Compact Yet Powerful
The core logic consists of approximately 300 lines of code, making it accessible for developers and researchers who are looking to understand or implement FP8 matrix multiplication.
4. Just-In-Time (JIT) Compiled
Written in CUDA, it doesn’t need compilation during installation. Instead, it compiles all kernels at runtime using a lightweight Just-In-Time (JIT) module.
5. Supports Both Dense and MoE Models
DeepGEMM isn’t just for one type of AI model. It works well with both standard dense models and a more specialized Mixture of Experts (MoE) models. That means it’s flexible enough to handle different kinds of workloads without slowing down.
How DeepGEMM Works Under the Hood
It provides different ways to handle matrix multiplications. Here are the key functions:
1. Normal GEMMs for Dense Models
Use deep_gemm.gemm_fp8_fp8_bf16_nt for basic FP8 matrix multiplications.
2. Grouped GEMMs for MoE Models
- Contiguous Layout: When MoE experts process different numbers of tokens, DeepGEMM combines them into one tensor for efficiency. The function
m_grouped_gemm_fp8_fp8_bf16_nt_contiguous is specifically designed for this purpose. - Masked Layout: For real-time inference, DeepGEMM uses a mask to skip unnecessary calculations, improving speed. The function
m_grouped_gemm_fp8_fp8_bf16_nt_maskedis specifically designed for this purpose.
Performance Benchmarks of DeepGEMM
DeepGEMM has been tested on hardware like the H800 with NVCC 12.8, and the results are impressive.
1. Normal GEMMs
DeepGEMM shows remarkable performance for normal GEMMs in dense models.
- For a common AI workload (matrix size 128×7168×16384), DeepGEMM achieves 645 TFLOPS, a 1.4x speedup over CUTLASS-based implementations.
- For even bigger matrices (4096×7168×16384), it reaches 1358 TFLOPS, offering a 1.2x speedup over other expert-tuned libraries.
2. Grouped GEMMs
DeepGEMM excels with both contiguous and masked layouts for Mixture-of-Experts models.
- In the contiguous layout, DeepGEMM achieves a 1.2x speedup for 4 groups with 8192 M per group, 4096 N, and 7168 K, reaching 1297 TFLOPS.
- For masked layouts, it demonstrates a consistent 1.2x speedup across various configurations, showing the library’s versatility in handling different MoE layouts.
The Secret Powers Behind DeepGEMM
DeepGEMM’s impressive speed and efficiency come from a bunch of smart design choices. Let’s break it down in simple terms:
1. Fine-Grained Scaling Approach
DeepGEMM follows a method inspired by DeepSeek-V3 that helps keep calculations precise while squeezing out the best possible performance from FP8 operations.
2. Hopper TMA Features
Moving data around efficiently is key to making things run fast. DeepGEMM carefully controls how information flows between different levels of memory. The Hopper architecture brings in TMA (Tensor Memory Accelerator), a technique to keep things moving smoothly.
DeepGEMM taps into this to quickly load and store matrices and handle multicast operations. The result? Less waiting, more computing.
3. Persistent Warp-Specialization
GPUs process data in groups of threads called warps. DeepGEMM makes sure each warp is fine-tuned for specific tasks, such as fetching data while computing at the same time. This keeps the GPU busy and avoids wasted time waiting around for data to be ready.
4. Fully JIT Design
Instead of pre-compiling everything ahead of time, DeepGEMM uses Just-In-Time (JIT) compilation. This means it figures out the best way to run calculations right when they’re needed, leading to better performance
5. Unaligned Block Sizes
Most systems work best with block sizes that are powers of 2, but that can sometimes leave resources underused. DeepGEMM sidesteps this by allowing block sizes that don’t strictly follow that rule, making sure all parts of the GPU are working as efficiently as possible.
6. FFMA SASS Interleaving
This might sound complex, but it’s basically a way of tweaking the instructions the GPU follows to improve performance. By adjusting specific bits in the compiled code, DeepGEMM boosts warp-level parallelism and keeps instructions flowing smoothly.
7. Performance Across Matrix Sizes
Some libraries work great for specific matrix sizes but struggle with others. DeepGEMM keeps up strong performance no matter what size matrix you throw at it, making it a reliable choice for all kinds of AI applications.
How to Get Started With DeepGEMM
Getting started with DeepGEMM requires meeting certain system requirements and following a straightforward installation process.
You need:
- Hopper architecture GPUs with sm_90a support
- Python 3.8 or above
- CUDA 12.3 or above (12.8 or above recommended for optimal performance)
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (can be cloned via Git submodule)
Installation Process
The installation process involves a few simple steps:
1. Clone the repository with submodules:
git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git
2. Create symbolic links for third-party include directories:
python setup.py develop
3. Test JIT compilation:
python tests/test_jit.py
4. Test all GEMM implementations (normal, contiguous-grouped, and masked-grouped):
python tests/test_core.py
5. Install the library
python setup.py install
After installation, you can import deep_gemm in your Python project and start leveraging its capabilities. DeepGEMM comes with several helpful utility functions. You can use them to get the most out of it. Check the repository for details!
Real-World Benefits of DeepGEMM in AI Development
1. Faster Training Times
Because DeepGEMM is optimized for FP8 operations, it can speed up the training of massive AI models, especially those using DeepSeek’s V3 architecture. Faster training means quicker results, lower costs, and more efficient research cycles.
2. MoE Model Deployment
DeepGEMM supports MoE-grouped GEMMs, making it an excellent choice for running complex MoE models in real-world applications. Its masked layout functionality helps keep inference speeds low, which is crucial when deploying AI systems that need to respond quickly.
3. More Efficient Resource Usage
The optimized performance of DeepGEMM means you can get more work done with the same hardware resources, maximizing the return on investment in expensive GPU infrastructure.
4. Developer-Friendly Design
Despite all its optimizations, DeepGEMM isn’t overly complicated. Its clean and understandable codebase makes it easier to integrate into existing projects without endless debugging and troubleshooting.
5. No-Hassle Deployment
Because DeepGEMM uses JIT compilation, it eliminates the need for complicated build processes. This makes it easy to set up and run in different environments without extra headaches..
Wrapping Up
DeepGEMM stands as a powerful addition to the AI development toolkit, offering exceptional performance, clean implementation, and versatile functionality. As part of DeepSeek’s Open Source Week initiative, this reflects the company’s commitment to advancing AI technology through collaborative innovation. You can check out the project and start experimenting on GitHub.
For anyone working with LLMs, especially those built on DeepSeek’s architectures, DeepGEMM delivers a rock-solid foundation for efficient matrix operations. Whether you’re running dense models or MoE architectures, its optimizations ensure that GPUs are used to their fullest potential, making AI development faster and more efficient than ever.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure







