The world of AI image generation is rapidly evolving, with diffusion models leading the charge. However, the computational demands of these powerful models can be a significant bottleneck. That’s where Nunchaku v0.1.4 comes in, offering a groundbreaking solution for accelerating diffusion model inference. This release introduces SVDQuant, a novel quantization technique that drastically reduces memory usage and boosts speed, all while maintaining impressive image quality.

Table of contents
- What is Nunchaku?
- The Challenge: Scaling Diffusion Models
- What is SVDQuant and How Does It Work?
- Nunchaku Performance and Quality: The Best of Both Worlds
- Nunchaku Image Quality: Maintaining Visual Fidelity
- Installation and Usage
- Wheels (for Linux and Windows WSL):
- Build from Source:
- Conclusion: The Future of Efficient Diffusion Models With Nunchaku
What is Nunchaku?
Nunchaku is an inference engine designed to optimize and accelerate diffusion models. The latest version, v0.1.4, leverages a cutting-edge technique called SVDQuant to achieve remarkable performance gains. The name itself, Nunchaku, evokes the speed and efficiency of the martial arts weapon, reflecting the project’s focus on rapid and streamlined inference. You can find the inference engine at Github.
The Challenge: Scaling Diffusion Models
Diffusion models, like FLUX.1 and PixArt-∑, are capable of generating incredibly detailed and realistic images. As these models scale up in size (reaching billions of parameters), the computational requirements become immense. This leads to:
- High memory usage.
- Longer processing times.
- Difficulty deploying on resource-constrained devices (like laptops).
Traditional approaches to model optimization, like quantization, have focused primarily on reducing the size of model weights. However, diffusion models are often computationally bound, meaning that simply reducing weight size doesn’t translate to significant speed improvements.
What is SVDQuant and How Does It Work?
SVDQuant (Singular Value Decomposition Quantization) is a post-training quantization technique that addresses the limitations of previous methods. It achieves this by:
- Quantizing both weights and activations: Unlike methods that focus solely on weights, SVDQuant aggressively quantizes both weights and activations to 4 bits. This is crucial for achieving real speedups in computationally bound models.
- Absorbing Outliers with a Low-Rank Branch: A major challenge with 4-bit quantization is the presence of outliers in the values of weights and activations. SVDQuant cleverly uses Singular Value Decomposition (SVD) to decompose the weight matrix into a low-rank component and a residual.
- The low-rank component, run at 16-bit precision, captures the most significant information (including the outliers).
- The residual component can then be safely quantized to 4 bits.
- Kernel Fusion with Nunchaku: The Nunchaku inference engine is co-designed with SVDQuant. It fuses the operations of the low-rank branch with the 4-bit quantization kernels, minimizing memory access and kernel call overhead. This is key for realizing practical speedups.
Nunchaku Performance and Quality: The Best of Both Worlds
The results of SVDQuant and Nunchaku are impressive:
- 3.6x Memory Reduction: The model size of the 12B FLUX.1 model is reduced by 3.6x compared to the BF16 model.
- Up to 10.1x Speedup: On a 16GB laptop 4090 GPU, Nunchaku achieves up to a 10.1x speedup by eliminating CPU offloading and outperforming the NF4 W4A16 baseline by 3x.
- 3.5 Memory Usage reduction: Nunchaku reduces memory by 3.5 times compared to 16-bit.
Nunchaku Image Quality: Maintaining Visual Fidelity
Despite the aggressive 4-bit quantization, SVDQuant maintains excellent image quality:
- Outperforms Baselines: The 4-bit models generated with SVDQuant outperform NF4 W4A16 baselines on FLUX.1 models, exhibiting better text alignment and closer resemblance to the 16-bit models.
- Superior to W4A8: On PixArt-∑ and SDXL-Turbo, SVDQuant’s 4-bit results surpass the visual quality of other W4A4 and even W4A8 baselines.
- LoRA Compatibility: SVDQuant seamlessly integrates with LoRA (Low-Rank Adaptation), allowing for style customization without requantization.

Installation and Usage
Getting started with Nunchaku is straightforward. The installation process varies slightly depending on your operating system and environment:
Wheels (for Linux and Windows WSL):
The easiest way to install is using pre-built wheels:
pip install torch==2.6 torchvision==0.21 torchaudio==2.6 # Example: Install PyTorch 2.6
pip install https://huggingface.co/mit-han-lab/nunchaku/resolve/main/nunchaku-0.1.4+torch2.6-cp311-cp311-linux_x86_64.whl # Example: For Python 3.11 and PyTorch 2.6
Make sure you have PyTorch >= 2.5, and replace with link from huggingface.
Build from Source:
For more control or specific configurations, you can build from source:
conda create -n nunchaku python=3.11
conda activate nunchaku
pip install torch torchvision torchaudio
pip install ninja wheel diffusers transformers accelerate sentencepiece protobuf huggingface_hub
pip install peft opencv-python gradio spaces GPUtil # For gradio demos
#Optional NVFP4 on Blackwell
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
#Install make sure g++ >=11
conda install -c conda-forge gxx=11 gcc=11
git clone https://github.com/mit-han-lab/nunchaku.git
cd nunchaku
git submodule init
git submodule update
pip install -e . --no-build-isolation
- Ensure your CUDA version is ≥ 12.2 on Linux and ≥ 12.6 on Windows.
- Currently supports NVIDIA GPUs with architectures sm_86 (Ampere: RTX 3090, A6000), sm_89 (Ada: RTX 4090), and sm_80 (A100).
Conclusion: The Future of Efficient Diffusion Models With Nunchaku
Nunchaku v0.1.4, powered by SVDQuant, represents a significant step forward in making large diffusion models more accessible and efficient. By enabling 4-bit quantization of both weights and activations, and through clever kernel fusion, Nunchaku unlocks substantial memory savings and impressive speedups without sacrificing image quality. This opens up new possibilities for deploying these powerful models on a wider range of devices and accelerating research in the field. The Nunchaku project demonstrates that high-performance, low-precision inference is achievable even for the most demanding AI models. The Nunchaku engine provides speed and optimization.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure


