We just witnessed something incredible: the largest open-source language model flexing its muscles on Apple Silicon. We’re talking about the massive DeepSeek-V3 on M4 Mac, specifically the 671 billion parameter model running on a cluster of 8 M4 Pro Mac Minis with 64GB of RAM each – that’s a whopping 512GB of combined memory!
This isn’t just about bragging rights. It opens up new possibilities for researchers, developers, and anyone interested in pushing the boundaries of AI. Let’s dive into the details and see why DeepSeek-V3 on M4 Mac is such a big deal.
Table of contents
- The Results Are In: DeepSeek V3 671B Performance on the M4 Mac Mini Cluster
- Why So Fast? Understanding the DeepSeek-V3 on M4 Mac Performance Advantage
- Exploring Key Considerations: Power, Cost, and Alternative Setups for Running DeepSeek-V3
- Conclusion: The Future of LLM Inference on Apple Silicon with DeepSeek-V3 on M4 Mac
The Results Are In: DeepSeek V3 671B Performance on the M4 Mac Mini Cluster
You want the numbers, right? Here’s how the DeepSeek-V3 on M4 Mac cluster performed, compared to some other well-known models:

The immediate takeaway? DeepSeek-V3 on M4 Mac, despite its immense size, isn’t just running – it’s running surprisingly well. The Time-To-First-Token (TTFT) is impressively low, and the Tokens-Per-Second (TPS) is solid.
But the real head-turner is this: Deepseek with 671B parameters is running faster than Llama 70B on the same M4 Mac setup? Yes, you read that correctly. Let’s break down why.
Why So Fast? Understanding the DeepSeek-V3 on M4 Mac Performance Advantage
To understand this surprising result, we need to take a step back and look at how Large Language Models (LLMs) work during inference – the process of generating text. Think of it as the model “thinking” and producing its output.
While we’re excited to share these initial findings about DeepSeek-V3 on M4 Mac, the full story of how the software used in it, distributes models is a bit more complex. For now, let’s focus on the big picture of why DeepSeek-V3 on M4 Mac performs so well.
LLM Inference Explained: A Systems Perspective on Running Large Models
Imagine an LLM as a giant recipe book filled with billions of ingredients (parameters). When you ask it a question, it needs to find the right ingredients and combine them in the correct order to give you an answer (generated text).
At its heart, an LLM is a massive collection of these parameters, billions of numbers that determine how the model behaves. LLMs are “autoregressive,” meaning they generate text one token (a word or part of a word) at a time, and each token depends on the previous ones.
For each token generated, the model performs a lot of calculations using these parameters. These calculations happen on powerful processors, typically GPUs, which are designed for this kind of heavy lifting.
Here’s the crucial point for standard LLMs, generating each token requires accessing all those billions of parameters. Think of it as needing to flip through the entire recipe book for each word you write.
So, what happens for each token?
- Load the Model Parameters: The model’s instructions (parameters) need to be loaded onto the processor.
- Perform Calculations: The processor performs mathematical operations using these parameters.
- Sample the Next Token: Based on the calculations, the model chooses the next word or part of a word.
- Repeat: This process repeats, feeding the newly generated token back into the model to generate the next one.
Steps 1 and 2 are the most time-consuming, so let’s focus on them. How quickly we can load the parameters and perform calculations determines how fast the model can generate text.
Memory Bandwidth vs. Compute: The Bottlenecks in LLM Inference
There are two main things that can slow down this process:
- Memory Bandwidth: How fast can we move those billions of parameters from memory to the processor? Think of this as the width of the highway delivering the ingredients. If the highway is narrow, it takes longer to get everything there.
- Compute: How fast can the processor perform the calculations once it has the parameters? This is like how quickly the chef can chop, mix, and cook the ingredients.
Whether inference is limited by memory bandwidth or compute depends on the relationship between these two factors. We can express this relationship using a ratio: C / M
Where:
- C (Compute Rate): How many parameters can the processor work on per second? This is calculated as: FLOPS/second ÷ (FLOPS/parameter)
- FLOPS/second: The total number of floating-point operations the processor can do per second (its raw processing power).
- FLOPS/parameter: The number of floating-point operations needed for each parameter.
- M (Memory Transfer Rate): How many parameters can we move to the processor per second? This is calculated as: Memory bandwidth ÷ (Bytes/parameter)
- Memory bandwidth: How much data can be moved from memory to the processor each second.
- Bytes/parameter: How much memory each parameter takes up (this depends on the model’s precision, like 4-bit in the example).
If C / M > 1, we’re limited by memory bandwidth – the highway is too narrow. If C / M < 1, we’re limited by compute – the chef isn’t fast enough, even with all the ingredients ready.
Interestingly, this relationship changes depending on how many requests the model is processing at once (the batch size). For generating one sequence at a time (batch size = 1), like in the tests with DeepSeek-V3 on M4 Mac, inference is often limited by memory bandwidth.
Apple Silicon’s Secret Weapon: Unified Memory and High Bandwidth for DeepSeek-V3 on M4 Mac
This is where Apple Silicon shines. It’s particularly good at running LLMs with a batch size of 1, like when you’re having a conversation with an AI. Why? Two key reasons:
- Unified Memory: Apple Silicon uses a “unified memory” architecture. Imagine the processor and the memory living on the same chip, with incredibly fast connections between them. This allows the GPU to access the full 192GB of memory on a single chip at very high speeds. It’s like having all the ingredients right next to the chef.
- High Memory Bandwidth to FLOPS Ratio: The ratio of memory bandwidth to processing power is very high in Apple Silicon, especially in the latest M4 chips. For example, the M4 Max has a memory bandwidth of 546GB/s and roughly 34 TFLOPS of processing power (FP16). This translates to a ratio of approximately 8.02. In comparison, an NVIDIA RTX 4090 has a ratio of around 1.52.
This means Apple Silicon is exceptionally good at quickly feeding the processor with the data it needs for single requests, making it surprisingly efficient for running large models like DeepSeek-V3 on M4 Mac when you’re generating one response at a time.
Mixture-of-Experts (MoE) Models: The Key to DeepSeek V3’s Efficiency
Now, let’s bring Mixture-of-Experts (MoE) models into the picture. This is the architecture used by DeepSeek V3 671B, and it’s crucial to understanding its performance on the DeepSeek-V3 on M4 Mac cluster.
Think of an MoE model as having multiple specialized “expert” models within it. For each input, only a small subset of these experts is activated to process the information.
So, while DeepSeek-V3 on M4 Mac has a massive 671 billion parameters, it doesn’t use all of them for every token generation. It only activates a smaller group of experts. However, the catch is, the model needs to have all the parameters readily available because it doesn’t know in advance which experts will be needed.
DeepSeek-V3 on M4 Mac: Why This Setup Works So Well for MoE Models
This is where the combination of DeepSeek-V3 on M4 Mac really shines:
- Ample Memory: The 512GB of combined memory in the M4 Mac Mini cluster allows us to load all 671 billion parameters of DeepSeek V3. All the “experts” are ready and waiting.
- Efficient Inference: Because Apple Silicon is so good at quickly accessing data, the model can efficiently load the parameters needed for the activated experts.
In the case of DeepSeek V3, while it has 671 billion parameters, it might only use around 37 billion for generating a single token. Compare this to a dense model like Llama 70B, which uses all 70 billion parameters for every token. As long as we can keep all the parameters in memory, DeepSeek-V3 on M4 Mac can generate a single response faster because it’s only doing calculations on a smaller subset of its total parameters.
Exploring Key Considerations: Power, Cost, and Alternative Setups for Running DeepSeek-V3
Power Consumption:
The impressive performance of DeepSeek-V3 on an M4 Mac cluster naturally leads to some important questions about the practicalities of running such powerful models. One immediate consideration is power consumption. Running large AI models can be energy-intensive, and understanding the power requirements is crucial for planning and budgeting.
In the setup, the cluster of eight Mac Minis has a maximum power draw of around 1120W, with a minimum idle draw of about 40W. Of course, this doesn’t account for the power needed for networking and any client devices involved. It’s interesting to note that this level of power consumption might seem relatively modest when compared to some high-end GPU-based systems often used for similar tasks.
Efficiency:
Another key area of interest is the cost and performance comparison with alternative hardware. Many are curious about how a Mac cluster stacks up against GPU-based setups, like those using NVIDIA 3090s. For models that fit comfortably within the VRAM of a 3090, those setups can be very effective. However, for larger models like DeepSeek V3, which exceed the memory capacity of a single 3090, the Mac cluster demonstrates a compelling level of performance.
The discussion around cost-effectiveness is multi-faceted. While the initial investment for a Mac cluster needs to be considered, the second-hand market for GPUs like the 3090 offers another angle, providing potentially attractive price-to-performance ratios. Furthermore, the ongoing cost of electricity plays a significant role, especially in regions with higher energy prices. For individual users or smaller-scale deployments, the lower power consumption of the Mac setup can be a considerable advantage over time. However, it’s also important to acknowledge that the physical design and form factor of Mac Minis might not be ideal for traditional data center environments.
Older But Beefier Hardware:
Beyond dedicated clusters, there’s also the question of utilizing existing or more affordable hardware. The possibility of running large language models like DeepSeek V3 on used servers with substantial amounts of RAM is an interesting one. Systems with hundreds of gigabytes of RAM, paired with older server CPUs, could potentially house these large models.

The primary limitation in such setups would likely be the processing power of the CPU. While the model might fit in memory, the speed at which computations can be performed would likely be significantly lower compared to GPU-accelerated systems. This could result in slower token generation speeds. However, for specific use cases where real-time responsiveness isn’t paramount, exploring these more budget-friendly hardware options could be a worthwhile endeavor.
Conclusion: The Future of LLM Inference on Apple Silicon with DeepSeek-V3 on M4 Mac
Running DeepSeek-V3 on M4 Mac is more than just a technical achievement. It signifies a shift in how we can approach large language models. The unified memory architecture and the impressive memory bandwidth of Apple Silicon make it a surprisingly capable platform for running massive MoE models.
While GPU clusters remain powerful, the DeepSeek-V3 on M4 Mac example highlights the potential of Apple’s hardware, especially for research, development, and potentially even edge deployments where power efficiency and ease of use are important.
This opens the door for more individuals and smaller teams to experiment with cutting-edge AI models without requiring massive and power-hungry server infrastructure.
| Latest From Us
- How to Set Up MCP with Claude AI: Transform Your Development Workflow
- Cohere AI Drops Command A, The AI That’s Smarter, Faster and More Affordable
- Gemini Robotics: How Google’s New AI Models Are Revolutionizing the Physical World
- Spain Cracks Down on AI Deepfakes with Massive Fines for Hidden Tech
- Meta Is Testing Its First In-House AI Training hip To Lessen Reliance On Nvidia