DeepSeek AI just dropped ‘DeepEP‘ on the second day of their Open Source Week event. DeepEP is a special tool that helps AI models work better and faster, especially Mixture-of-Experts (MoE) models. This announcement comes right after DeepSeek’s first-day release of FlashMLA, an efficient MLA decoding kernel designed for Hopper GPUs. In this world of AI, training and running massive models can be painfully slow. DeepEP changes that by improving how information moves inside AI systems, making everything run much smoother.
Table of Contents
DeepEP, The Ultimate Expert-Parallel Library for MoE Models
What’s a Mixture-of-Experts (MoE) Model?
Imagine you’re in a classroom with four different teachers: a math teacher, an art teacher, a science teacher, and a music teacher. If you had a question about solving equations, it would be silly to ask all four teachers, right? You’d just ask the math teacher. That’s exactly how MoE models work!
Old-fashioned AI models would ask every “teacher” in their system to help answer every question. MoE models are smarter. They have a special “router” that sends your question to just the right expert. So, a question about science goes straight to the science expert. This makes these AI systems much more efficient, saving time and computing power.
Why DeepEP is a Game Changer
MoE models are powerful, but they come with a major challenge—moving data between different experts quickly. These models run on multiple GPUs, sometimes even across different servers, and they need to send information back and forth in milliseconds. That’s where DeepEP comes in.
1. Faster Communication Between GPUs
It is a communication library designed specifically to solve this problem. DeepEP includes custom-built tools that boost how fast information moves, making AI training and inference (the process of running a trained AI model) significantly more efficient.
2. Two Different Speed Modes for AI Processing
It is designed to handle both:
i. High-Throughput Mode
This is for tasks that need to process large amounts of data quickly, like training AI models.
ii. Low-Latency Mode
This is for real-time tasks, like when you’re talking to an AI chatbot and expect an instant response.
By switching between these two, DeepEP ensures that AI systems perform well no matter what they’re doing.
3. Works with Low-Precision Math
AI doesn’t always need perfect precision—sometimes, rounding off numbers slightly makes things faster without affecting the final result. DeepEP supports FP8, a special way of handling numbers that reduces computing power while keeping accuracy high.
How DeepEP Works Under the Hood
1. Optimized Data Routing
Moving data efficiently between different GPUs is key. DeepEP has special techniques that improve communication inside a server (using NVLink) and between servers (using RDMA).
2. Streaming Multiprocessor Control
It lets developers decide how much GPU power should be assigned to different tasks. This helps prevent one task from slowing everything else down.
3. Overlapping Communication and Computation
Normally, AI models have to stop computing while waiting for data to transfer. DeepEP fixes this by allowing computation to continue while data is still being sent, keeping the whole system running smoothly.
Performance Evaluation of DeepEP
To see just how well DeepEP performs, researchers tested it on H800 computer chips with super-fast network cards. The results were impressive:
1. Intranode Performance
With eight experts, DeepEP reached 158 GB/s through NVLink, which is almost the highest possible speed for moving data inside a server.
2. Internode Performance
Internode performance metrics indicated that it maintained speeds of 43-47 GB/s over RDMA (a high-speed network for AI computing).
3. For Real-Time AI Responses
DeepEP processed expert decisions in just 163 microseconds, which is faster than the blink of an eye.
Even when tested with massive AI models using 256 different experts, DeepEP still performed smoothly, showing that it can handle large-scale AI workloads without slowing down.
How to Get Started With DeepSeek DeepEP
DeepEP requires specific hardware and software to work at full capacity. Here’s what’s needed:
- Hopper GPUs (DeepEP currently supports high-end GPUs, but more options may be added later)
- Python 3.8+
- CUDA 12.3+ (for running AI workloads on GPUs)
- PyTorch 2.1+
- NVLink connections between GPUs
- RDMA network for connecting multiple computers
DeepEP also needs a modified version of NVSHMEM, which has to be installed separately. The setup guide in the repository explains how to get everything running.
Making the Most of DeepEP
To get the best performance out of DeepSeek DeepEP, you can use the following network setup tips: traffic isolation and smart routing.
1. Traffic Isolation
DeepEP supports traffic isolation through InfiniBand Virtual Lanes (VL), allowing different types of workloads to operate without interference.
2. Smart Routing
With this, the system can adapt to busy or quiet network conditions:
- When networks are busy, DeepEP uses “adaptive routing” to find the least crowded path.
- When networks are quiet, it sticks to “static routing” for maximum speed
This flexibility helps DeepEP work well in all kinds of situations. The repository provides clear guidance on when to enable adaptive routing.
Why DeepEP Matters for the Future of AI
DeepEP is a big step forward in making super-smart AI systems available to more people and companies. Here’s why it’s important:
1. Faster AI model training
DeepEP makes it easier to create and train smart MoE models. So, AI researchers can train massive models more quickly.
2. Better real-time AI performance
AI-powered applications, like chatbots and recommendation systems, can respond much faster. DeepEP has low-latency kernels that are designed to minimize delays during the inference phase.
3. Lower cost of AI operations
Companies can run large AI models without spending as much on hardware.
4. Supports bigger AI models
Developers can build more powerful AI without worrying about slowing things down.
Secret Trick Exposed by DeepSeek Team
The DeepEP team discovered a secret trick that makes certain computer chips run faster than they’re supposed to. They use a special instruction called ld.global.nc.L1::no_allocate.L2::256B that isn’t in any manual but makes Hopper chips run much faster. Pretty sneaky!
Wrapping Up
DeepSeek isn’t stopping here. They’ve got three more releases coming, each designed to make AI training and performance even better. By sharing these tools with the world, they’re helping developers build smarter, faster AI models without having to reinvent the wheel. For now, DeepEP is a game-changer for companies building big, smart AI systems..
So, if you’re interested in the upcoming updates about the DeepSeek Open Source Week, stay tuned with us.
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure







