Premium Content Waitlist Banner

Digital Product Studio

DeepSeek-V3 on M4 Mac: Blazing Fast Inference on Apple Silicon

DeepSeek-V3 on M4 Mac: Blazing Fast Inference on Apple Silicon

We just witnessed something incredible: the largest open-source language model flexing its muscles on Apple Silicon. We’re talking about the massive DeepSeek-V3 on M4 Mac, specifically the 671 billion parameter model running on a cluster of 8 M4 Pro Mac Minis with 64GB of RAM each – that’s a whopping 512GB of combined memory!

This isn’t just about bragging rights. It opens up new possibilities for researchers, developers, and anyone interested in pushing the boundaries of AI. Let’s dive into the details and see why DeepSeek-V3 on M4 Mac is such a big deal.

The Results Are In: DeepSeek V3 671B Performance on the M4 Mac Mini Cluster

You want the numbers, right? Here’s how the DeepSeek-V3 on M4 Mac cluster performed, compared to some other well-known models:

DeepSeek-V3 on Apple’s M4 Mac Mini

The immediate takeaway? DeepSeek-V3 on M4 Mac, despite its immense size, isn’t just running – it’s running surprisingly well. The Time-To-First-Token (TTFT) is impressively low, and the Tokens-Per-Second (TPS) is solid.

But the real head-turner is this: Deepseek with 671B parameters is running faster than Llama 70B on the same M4 Mac setup? Yes, you read that correctly. Let’s break down why.

Why So Fast? Understanding the DeepSeek-V3 on M4 Mac Performance Advantage

To understand this surprising result, we need to take a step back and look at how Large Language Models (LLMs) work during inference – the process of generating text. Think of it as the model “thinking” and producing its output.

While we’re excited to share these initial findings about DeepSeek-V3 on M4 Mac, the full story of how the software used in it, distributes models is a bit more complex. For now, let’s focus on the big picture of why DeepSeek-V3 on M4 Mac performs so well.

LLM Inference Explained: A Systems Perspective on Running Large Models

Imagine an LLM as a giant recipe book filled with billions of ingredients (parameters). When you ask it a question, it needs to find the right ingredients and combine them in the correct order to give you an answer (generated text).

At its heart, an LLM is a massive collection of these parameters, billions of numbers that determine how the model behaves. LLMs are “autoregressive,” meaning they generate text one token (a word or part of a word) at a time, and each token depends on the previous ones.

For each token generated, the model performs a lot of calculations using these parameters. These calculations happen on powerful processors, typically GPUs, which are designed for this kind of heavy lifting.

Here’s the crucial point for standard LLMs, generating each token requires accessing all those billions of parameters. Think of it as needing to flip through the entire recipe book for each word you write.

So, what happens for each token?

  1. Load the Model Parameters: The model’s instructions (parameters) need to be loaded onto the processor.
  2. Perform Calculations: The processor performs mathematical operations using these parameters.
  3. Sample the Next Token: Based on the calculations, the model chooses the next word or part of a word.
  4. Repeat: This process repeats, feeding the newly generated token back into the model to generate the next one.

Steps 1 and 2 are the most time-consuming, so let’s focus on them. How quickly we can load the parameters and perform calculations determines how fast the model can generate text.

Memory Bandwidth vs. Compute: The Bottlenecks in LLM Inference

There are two main things that can slow down this process:

  • Memory Bandwidth: How fast can we move those billions of parameters from memory to the processor? Think of this as the width of the highway delivering the ingredients. If the highway is narrow, it takes longer to get everything there.
  • Compute: How fast can the processor perform the calculations once it has the parameters? This is like how quickly the chef can chop, mix, and cook the ingredients.

Whether inference is limited by memory bandwidth or compute depends on the relationship between these two factors. We can express this relationship using a ratio: C / M

Where:

  • C (Compute Rate): How many parameters can the processor work on per second? This is calculated as: FLOPS/second ÷ (FLOPS/parameter)
    • FLOPS/second: The total number of floating-point operations the processor can do per second (its raw processing power).
    • FLOPS/parameter: The number of floating-point operations needed for each parameter.
  • M (Memory Transfer Rate): How many parameters can we move to the processor per second? This is calculated as: Memory bandwidth ÷ (Bytes/parameter)
    • Memory bandwidth: How much data can be moved from memory to the processor each second.
    • Bytes/parameter: How much memory each parameter takes up (this depends on the model’s precision, like 4-bit in the example).

If C / M > 1, we’re limited by memory bandwidth – the highway is too narrow. If C / M < 1, we’re limited by compute – the chef isn’t fast enough, even with all the ingredients ready.

Interestingly, this relationship changes depending on how many requests the model is processing at once (the batch size). For generating one sequence at a time (batch size = 1), like in the tests with DeepSeek-V3 on M4 Mac, inference is often limited by memory bandwidth.

Apple Silicon’s Secret Weapon: Unified Memory and High Bandwidth for DeepSeek-V3 on M4 Mac

This is where Apple Silicon shines. It’s particularly good at running LLMs with a batch size of 1, like when you’re having a conversation with an AI. Why? Two key reasons:

  1. Unified Memory: Apple Silicon uses a “unified memory” architecture. Imagine the processor and the memory living on the same chip, with incredibly fast connections between them. This allows the GPU to access the full 192GB of memory on a single chip at very high speeds. It’s like having all the ingredients right next to the chef.
  2. High Memory Bandwidth to FLOPS Ratio: The ratio of memory bandwidth to processing power is very high in Apple Silicon, especially in the latest M4 chips. For example, the M4 Max has a memory bandwidth of 546GB/s and roughly 34 TFLOPS of processing power (FP16). This translates to a ratio of approximately 8.02. In comparison, an NVIDIA RTX 4090 has a ratio of around 1.52.

This means Apple Silicon is exceptionally good at quickly feeding the processor with the data it needs for single requests, making it surprisingly efficient for running large models like DeepSeek-V3 on M4 Mac when you’re generating one response at a time.

Mixture-of-Experts (MoE) Models: The Key to DeepSeek V3’s Efficiency

Now, let’s bring Mixture-of-Experts (MoE) models into the picture. This is the architecture used by DeepSeek V3 671B, and it’s crucial to understanding its performance on the DeepSeek-V3 on M4 Mac cluster.

Think of an MoE model as having multiple specialized “expert” models within it. For each input, only a small subset of these experts is activated to process the information.

So, while DeepSeek-V3 on M4 Mac has a massive 671 billion parameters, it doesn’t use all of them for every token generation. It only activates a smaller group of experts. However, the catch is, the model needs to have all the parameters readily available because it doesn’t know in advance which experts will be needed.

DeepSeek-V3 on M4 Mac: Why This Setup Works So Well for MoE Models

This is where the combination of DeepSeek-V3 on M4 Mac really shines:

  • Ample Memory: The 512GB of combined memory in the M4 Mac Mini cluster allows us to load all 671 billion parameters of DeepSeek V3. All the “experts” are ready and waiting.
  • Efficient Inference: Because Apple Silicon is so good at quickly accessing data, the model can efficiently load the parameters needed for the activated experts.

In the case of DeepSeek V3, while it has 671 billion parameters, it might only use around 37 billion for generating a single token. Compare this to a dense model like Llama 70B, which uses all 70 billion parameters for every token. As long as we can keep all the parameters in memory, DeepSeek-V3 on M4 Mac can generate a single response faster because it’s only doing calculations on a smaller subset of its total parameters.

Exploring Key Considerations: Power, Cost, and Alternative Setups for Running DeepSeek-V3

Power Consumption:

The impressive performance of DeepSeek-V3 on an M4 Mac cluster naturally leads to some important questions about the practicalities of running such powerful models. One immediate consideration is power consumption. Running large AI models can be energy-intensive, and understanding the power requirements is crucial for planning and budgeting.

In the setup, the cluster of eight Mac Minis has a maximum power draw of around 1120W, with a minimum idle draw of about 40W. Of course, this doesn’t account for the power needed for networking and any client devices involved. It’s interesting to note that this level of power consumption might seem relatively modest when compared to some high-end GPU-based systems often used for similar tasks.

Efficiency:

Another key area of interest is the cost and performance comparison with alternative hardware. Many are curious about how a Mac cluster stacks up against GPU-based setups, like those using NVIDIA 3090s. For models that fit comfortably within the VRAM of a 3090, those setups can be very effective. However, for larger models like DeepSeek V3, which exceed the memory capacity of a single 3090, the Mac cluster demonstrates a compelling level of performance.

The discussion around cost-effectiveness is multi-faceted. While the initial investment for a Mac cluster needs to be considered, the second-hand market for GPUs like the 3090 offers another angle, providing potentially attractive price-to-performance ratios. Furthermore, the ongoing cost of electricity plays a significant role, especially in regions with higher energy prices. For individual users or smaller-scale deployments, the lower power consumption of the Mac setup can be a considerable advantage over time. However, it’s also important to acknowledge that the physical design and form factor of Mac Minis might not be ideal for traditional data center environments.

Older But Beefier Hardware:

Beyond dedicated clusters, there’s also the question of utilizing existing or more affordable hardware. The possibility of running large language models like DeepSeek V3 on used servers with substantial amounts of RAM is an interesting one. Systems with hundreds of gigabytes of RAM, paired with older server CPUs, could potentially house these large models.

Old Server System With 256 GB Ram
Old Server System With 256 GB Ram

The primary limitation in such setups would likely be the processing power of the CPU. While the model might fit in memory, the speed at which computations can be performed would likely be significantly lower compared to GPU-accelerated systems. This could result in slower token generation speeds. However, for specific use cases where real-time responsiveness isn’t paramount, exploring these more budget-friendly hardware options could be a worthwhile endeavor.

Conclusion: The Future of LLM Inference on Apple Silicon with DeepSeek-V3 on M4 Mac

Running DeepSeek-V3 on M4 Mac is more than just a technical achievement. It signifies a shift in how we can approach large language models. The unified memory architecture and the impressive memory bandwidth of Apple Silicon make it a surprisingly capable platform for running massive MoE models.

While GPU clusters remain powerful, the DeepSeek-V3 on M4 Mac example highlights the potential of Apple’s hardware, especially for research, development, and potentially even edge deployments where power efficiency and ease of use are important.

This opens the door for more individuals and smaller teams to experiment with cutting-edge AI models without requiring massive and power-hungry server infrastructure.

| Latest From Us

SUBSCRIBE TO OUR NEWSLETTER

Stay updated with the latest news and exclusive offers!


* indicates required
Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Leave a Reply

Your email address will not be published. Required fields are marked *

YouTube’s SHOCKING New AI Tool: It Knows Your Kid’s Age!

YouTube's SHOCKING New AI Tool: It Knows Your Kid's Age!

Big changes are coming to YouTube’s approach to protecting young users. The platform is taking a significant step toward Child Safety and Teen Safety online by implementing new technologies and policies aimed at creating a safer digital environment. Central to this initiative is the use of Age-Estimation Tech and Age Verification methods, which will help ensure that content and features are appropriate for different age groups. These advancements reflect a broader commitment to YouTube Safety, leveraging cutting-edge AI Technology to better identify and manage age-appropriate content for children and teens.

Key Takeaways:

  • YouTube is using AI-powered age-estimation technology to identify teen users in the US.
  • Once a user is identified as a teen, protections like disabling personalized ads and limiting repetitive viewing of certain content are automatically enabled.
  • Users incorrectly identified as minors can verify their age using a credit card, government ID, or selfie.
  • This rollout comes amid increased government scrutiny of social media platforms and state laws regarding minors’ use of social media.

Table of Contents

YouTube is rolling out age-estimation technology in the U.S. to better identify teen users and apply more appropriate safety measures. This new system uses a variety of signals to determine a user’s age, irrespective of the birthday information provided during account creation. This marks a significant shift in YouTube’s approach to age verification, moving beyond relying solely on self-reported data.

Enhanced Protections for Teen Users

Once YouTube’s system identifies a user as a teen, several protections automatically kick in. These include disabling personalized advertising, a feature designed to minimize exposure to potentially inappropriate content tailored to specific user profiles. Powered by Age-Estimation Tech, the system can more accurately determine user age and apply the appropriate safety measures to enhance Teen Safety. Additional safeguards are put in place to limit repetitive viewing of potentially harmful content, such as videos that may trigger body image issues or promote social aggression. This builds on measures introduced in 2023 to curb repeated viewing of such videos.

Further bolstering these measures, digital well-being tools are also automatically enabled. These features, under development since 2018, include screen time limits and bedtime reminders, empowering teens and their parents to manage online time more effectively. While these protections existed before, they were only applied to users who actively verified their age as teenagers.

Addressing Incorrect Age Estimations

Age-Estimation Tech also plays a role in inferring a user’s age when formal verification isn’t provided. Only users who have verified their age as 18 or older, either through this verification process or through the AI system’s inference, will be able to access age-restricted content ensuring Teen Safety by restricting inappropriate material.

The Power of Machine Learning

The age-estimation technology at the heart of this initiative leverages the power of machine learning and AI Technology. This sophisticated system analyzes various user data points to make accurate age predictions. Though YouTube hasn’t shared specifics, its advanced approach will better identify underage users.

The phased rollout is a key element of YouTube’s strategy. YouTube will first deploy the technology to a small group of U.S. users for close monitoring and refinement before expanding it more broadly. This careful approach minimizes potential disruptions and allows for prompt adjustments based on real-world usage data.

A Broader Context: Government Scrutiny and State Laws

YouTube’s initiative comes at a time of heightened scrutiny of social media platforms. In the United States, several states have already enacted, or are considering, legislation to regulate minors’ online activity. Many of these laws mandate age verification or parental consent. This regulatory pressure underscores the importance of YouTube’s proactive use of AI Technology to ensure age-appropriate content and user safety.

The UK now requires age verification under the 2023 Online Safety Act, reflecting evolving child safety laws. These developments highlight the growing global consensus on the need for stronger online safety measures for young users.

YouTube’s Ongoing Commitment to Safety

YouTube’s latest efforts build upon previous initiatives aimed at creating a safer platform for children and teens. YouTube Kids and supervised accounts show YouTube’s ongoing commitment to child safety. The new age estimation technology represents a significant advancement in these efforts, leveraging cutting edge technology to enhance user protection.

These changes reflect a broader trend in the tech industry towards proactive measures to protect children online. The debate over age verification and child safety underscores the issue’s complexity for tech companies. YouTube’s commitment to developing and implementing these technologies is a step in the right direction toward addressing these critical concerns.

Conclusion

YouTube’s rollout of age-estimation technology signifies a substantial step toward creating a safer online environment for teenage users. YouTube is using advanced machine learning and safety features to better protect young users amid changing regulations. The continued development of this technology is crucial to the future of online safety for minors.

Sources consulted for this article include: androidheadlines.com, cbsnews.com, limit repetitive viewing.

| Latest From Us

Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Mind-Blowing Stories This Week That Will Leave You Speechless! (30th July 2025)

Mind-Blowing Stories This Week That Will Leave You Speechless! (30th July 2025)

Prepare to be amazed! This week’s news cycle is overflowing with incredible developments, from groundbreaking AI lawsuit to astonishing advancements in Artificial Intelligence and AI development. Get ready to dive into a world of unexpected twists and turns, where AI safety takes center stage, data centers power innovation, and technology pushes boundaries while human ingenuity shines.

Key Takeaways:

  • A lawsuit against OpenAI highlights crucial ethical questions surrounding AI development.
  • A near-collision between an airliner and a bomber prompts discussion on AI’s role in aviation safety.
  • Tests revealing AI’s weaknesses compared to humans offer insights into the path towards Artificial General Intelligence.
  • China’s investment in undersea data centers reveals innovative approaches to supporting AI growth.

Table of Contents

These popular stories highlight the incredible pace of change and the impact of technology, especially within the realm of artificial intelligence. The following articles offer fascinating insights into the complex and rapidly evolving world of AI, its potential, and its challenges. This collection of popular stories showcases both the promise and the peril of this transformative technology.

One of the most talked-about stories involves a lawsuit against OpenAI, the creator of ChatGPT. Tamlyn Hunt’s decision to sue the company has sent shockwaves through the tech industry. The details surrounding the lawsuit remain under development, but it promises to be a landmark case with implications for the future of AI development, AI safety, and regulation. According to Why I’m Suing OpenAI, the Creator of ChatGPT, the case highlights important ethical concerns.

Could AI Have Prevented a Near-Disaster?

In a chilling near-miss, a SkyWest airliner narrowly avoided a collision with a B-52 bomber. This incident has sparked a crucial discussion: could AI have played a role in preventing this catastrophic accident? Experts are now exploring the potential of AI-powered systems to enhance air traffic control and prevent future near-misses. Could AI Have Prevented SkyWest Airliner’s Near Collision with a B-52 Bomber? explores this possibility.

The Tests that AI Fails—and Humans Ace

Artificial General Intelligence (AGI) remains a holy grail in the field of AI. Researchers are constantly seeking ways to create machines with human-like intelligence. A fascinating study sheds light on tests that AI systems consistently fail, while humans excel highlighting critical challenges in AI safety along the path to achieving true AGI. The findings suggest that these tests could hold a key to unlocking the secrets of AGI. Tests that AIs Often Fail and Humans Ace Could Pave the Way for Artificial General Intelligence delves into this promising area of research.

China’s Undersea Data Centers Powering AI Growth

China is making significant strides in AI development, employing innovative strategies to meet the enormous data demands of this rapidly growing field. The country is investing heavily in undersea data centers, providing a unique and efficient approach to storing and processing vast quantities of information. China Powers AI Boom with Undersea Data Centers details this ambitious undertaking.

The Ethical Considerations of AI Development

As AI technology continues its rapid advancement, ethical concerns become increasingly paramount. The lawsuit against OpenAI highlights the need for robust ethical guidelines and regulations to prevent misuse and mitigate potential harm. Ongoing discussions regarding AI bias, transparency, and accountability are crucial to ensure responsible innovation.

The Future of AI: Promise and Peril

The future of AI remains unwritten, full of both immense potential and significant risks. While AI has the capacity to revolutionize numerous aspects of our lives, careful consideration must be given to its ethical implications, potential misuse, and unintended consequences. The stories highlighted above provide a glimpse into the complexities and challenges that lie ahead.

From legal battles to technological breakthroughs, the world of AI is constantly evolving. These popular stories underscore the importance of staying informed and engaging in thoughtful discussions about the future of this transformative technology. The responsible development and implementation of AI will be crucial for navigating the opportunities and challenges that lie ahead.

These popular stories serve as a reminder that the advancements in AI are not just technological feats, but carry significant social, ethical, and legal implications that demand careful consideration.

Conclusion

The stories discussed above showcase the dynamic and multifaceted nature of the AI landscape. They highlight the incredible advancements in the field, as well as the crucial need for responsible development and thoughtful consideration of the ethical and societal implications of AI. As AI technology continues its rapid evolution, it’s imperative to remain engaged, informed, and actively involved in shaping its future.

| Latest From Us

Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

🤯 My 2.5-Year-Old Laptop Now Codes Space Invaders Thanks to THIS!

Imagine this: a relatively old laptop, a cutting-edge AI model, and a classic arcade game resurrected in code. Thanks to advances in AI coding, powerful AI models, and large language models, code generation has become more accessible than ever. Using tools like MLX, this seemingly sci-fi scenario is now reality, as showcased in a recent blog post by Simon Willison.

Key Takeaways:

  • Simon Willison used GLM-4.5 Air, a large language model, to generate a Space Invaders game in JavaScript.
  • The code was generated and ran successfully on a 2.5-year-old MacBook Pro M2.
  • This demonstrates the increasing power and accessibility of AI-powered code generation tools.
  • The experiment highlights the rapid progress in AI’s ability to create functional, complex code.

Simon Willison’s blog showcased using a smaller GLM-4.5 Air large language model to generate a fully functional Space Invaders game in JavaScript. This wasn’t theoretical it ran flawlessly on a 2.5-year-old MacBook Pro M2, showcasing rapid AI coding advances. According to Simon Willison’s blog post, the process was surprisingly straightforward.

The Power of GLM-4.5 Air

The core of this achievement lies in the GLM-4.5 Air model, an open-source (MIT licensed) creation from Z.ai in China. While the full model is massive (106 billion parameters, 205.78GB on Hugging Face), Ivan Fioravanti created a 44GB 3-bit quantized version optimized for MLX. This smaller size makes it accessible to users with more modest hardware, including many personal laptops. As detailed on Simon Willison’s weblog, this quantization significantly reduces the resource requirements without sacrificing performance.

Space Invaders: AI Coding and Code Generation in Action

Willison used the MLX library and a simple command-line prompt to instruct the model to write the HTML and JavaScript for Space Invaders. The model generated the complete code, ready to run without any manual adjustments. The source code is available on GitHub, allowing anyone to explore the generated code and replicate the experiment.

Beyond Space Invaders: The Broader Implications

While a Space Invaders game might seem trivial, the implications of this experiment are far-reaching. It demonstrates the ability of sophisticated AI models to generate functional, complex code from simple prompts. This has significant potential across various fields, from software development to game creation.

Hardware Requirements and Performance

The fact that this was achieved on a 2.5-year-old laptop underscores the accessibility of these powerful AI tools. Willison notes that while the model consumed a significant portion of his laptop’s RAM (around 48GB at peak), leaving only 16GB for other applications, the speed was still impressive once the model loaded. This suggests a future where sophisticated AI coding is available to a broader range of users and devices.

The Future of AI-Powered Code Generation and AI Coding

The trend of AI models focusing on code generation is undeniably gaining momentum. Willison reflects on how the capabilities of these models have improved remarkably over the past two years. His series on LLMs on personal devices charts this progress, showcasing how much has been achieved in a short time. The increasing power and accessibility of models like GLM-4.5 Air suggest an exciting future for AI-assisted and potentially AI-driven software development.

Conclusion

Simon Willison’s experiment showcases the extraordinary progress in AI-powered code generation. The ability to generate functional code, even for relatively complex projects like Space Invaders, on a consumer-grade laptop using readily-available tools is remarkable. This technology demonstrates great potential for the future of software development, and we can expect to see even more impressive advancements in the years to come.

| Latest From Us

Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Don't Miss Out on AI Breakthroughs!

Advanced futuristic humanoid robot

*No spam, no sharing, no selling. Just AI updates.

Ads slowing you down? Premium members browse 70% faster.