Running Large Language Models (LLMs) locally is an exciting frontier, but VRAM limitations on GPUs can often be a frustrating bottleneck. Many users running GGUF format models find themselves unable to offload all model layers to their GPU, leading to compromised generation speeds. But what if there was a smarter way to manage your precious VRAM and achieve a speed boost of over 200%? The answer lies in GGUF tensor offloading, a more granular approach than simply offloading entire layers.
This technique is particularly game-changing if you’re currently leaving some layers on the CPU because you can’t fit the whole model onto your GPU. If you’re already offloading all layers, you’re in a good spot! However, this method might still allow you to run even larger, more capable models at speeds you previously thought unachievable. Let’s dive into how you can significantly enhance your model’s performance.

Table of contents
- The Old Way: Why Offloading Entire GGUF Layers Can Be Inefficient
- The Breakthrough: Strategic GGUF Tensor Offloading for Maximum Speed
- Are You a Candidate for This GGUF Tensor Offloading Speed Boost?
- How to Implement Selective Tensor Offloading with llama.cpp and koboldcpp
- Why Isn’t Selective GGUF Tensor Offloading the Default?
- The “Wow” Moment: Witnessing Double the GGUF Speed
- The Future: Smarter GGUF Offloading Automation
- Conclusion: Unleash Your GGUF Model’s True Potential with Tensor Offloading
The Old Way: Why Offloading Entire GGUF Layers Can Be Inefficient
Traditionally, when using tools like llama.cpp and its derivatives such as koboldcpp, users offload entire layers of a GGUF model to the GPU. Transformer layers, the building blocks of these models, are complex. They consist of various components, including attention tensors, feed-forward network (FFN) tensors, gates, and output tensors.
The challenge here is that not all parts of a layer benefit equally from running on the GPU, nor do they consume VRAM uniformly. Attention tensors, for instance, are computationally intensive in a way that GPUs excel at due to their parallel processing capabilities, and they are generally smaller in size. On the other hand, FFN tensors are often very large and involve more straightforward matrix multiplication. While GPUs can do this, CPUs are also quite capable, and these FFN tensors can be the biggest VRAM consumers within a layer.
When VRAM is scarce, and you can only offload, say, 59 out of 65 layers, it means some layers (including their large FFN tensors) are on the GPU, consuming valuable space, while other entire layers (including their attention mechanisms) might be relegated to the CPU, slowing down the process. This is where GGUF tensor offloading offers a more refined solution.
The Breakthrough: Strategic GGUF Tensor Offloading for Maximum Speed
Instead of treating layers as indivisible units for offloading, we can be more selective. The core idea is to identify the largest, most VRAM-hungry tensors within each layer – typically the FFN tensors – and specifically instruct the system to keep those tensors on the CPU.

Why does this help? By moving these voluminous FFN tensors to the CPU, you free up a substantial amount of GPU VRAM. This newly available VRAM can then be used to offload all the other, more GPU-critical parts of every layer, especially the attention mechanisms. The result, as demonstrated by one user, can be a staggering jump in performance – from 3.95 tokens per second (TPS) to 10.61 TPS with their QwQ merge model. This is a more than 2.5x increase in generation speed using the same amount of VRAM, simply by being smarter about what gets offloaded. This precise GGUF tensor offloading is key.
Are You a Candidate for This GGUF Tensor Offloading Speed Boost?
This innovative GGUF tensor offloading technique offers the most significant benefits to users who are currently unable to offload all their model layers to the GPU due to VRAM constraints. If you’re in this situation, you’re likely leaving performance on the table.
Even if you can already offload all layers, this method might open doors to running larger, more powerful GGUF models that you previously deemed too slow or VRAM-intensive for your setup. By strategically keeping certain FFN tensors on the CPU, you might find that those bigger models become surprisingly usable.
How to Implement Selective Tensor Offloading with llama.cpp and koboldcpp
The magic happens through specific command-line flags available in llama.cpp and koboldcpp. The key is the –overridetensors flag in koboldcpp or the -ot flag in llama.cpp. These allow you to use regular expressions (regex) to define which specific tensors should be assigned to the CPU.
Let’s look at a real-world example that achieved the 10.61 TPS:
python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU"
In this command:
- –gpulayers 65: Attempts to offload all 65 layers.
- –overridetensors “\.\.ffn_up|\.[1-3]\.ffn_up=CPU”: This is the crucial part. The regex targets ffn_up tensors.
- \.\.ffn_up: Targets ffn_up tensors in layers 1, 3, 5, 7, and 9.
- \.[1-3]\.ffn_up: Targets ffn_up tensors in odd-numbered layers from 11 up to 39 (e.g., 11, 13,… 37, 39).
- =CPU: Assigns these matched tensors to be processed by the CPU.
Contrast this with the baseline approach of only offloading entire layers until VRAM is full:
python ~/koboldcpp/koboldcpp.py --threads 6 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 59 --quantkv 1
This setup, with only 59 layers offloaded, yielded just 3.95 TPS. The difference is remarkable, highlighting the power of GGUF tensor offloading.
Step-by-Step: Identifying Which Tensors to Keep on CPU
To effectively use GGUF tensor offloading, you need to know which tensors are the best candidates for CPU assignment.
Inspect Your GGUF Model’s Structure:
You need to see the tensor names, their sizes, and their quantization within your specific GGUF model. You can often find this information on the model card on Hugging Face (if it provides detailed file info). This information can also be retrieved by using tools that can inspect GGUF metadata. For example, looking at the QwQ-32B.Q3_K_M.gguf model on Hugging Face, you might find tensor details like:
| Tensor | Size | Quantization |
|---|---|---|
| blk.1.ffn_down.weight | [27 648, 5 120] | Q5_K |
| blk.1.ffn_gate.weight | [5 120, 27 648] | Q3_K |
| blk.1.ffn_norm.weight | [5 120] | F32 |
| blk.1.ffn_up.weight | [5 120, 27 648] | Q3_K |
Target Large FFN Tensors:
The tensors typically named ffn_down.weight, ffn_up.weight, and ffn_gate.weight are part of the Feed-Forward Network and are usually the largest ones. These are prime candidates for CPU offloading.
Consider Quantization:
Tensors with higher quantization levels (e.g., Q5_K, Q6, Q8) will be larger than those with lower quantization (e.g., Q3_K, IQ4_XS) for the same dimensions. Offloading a Q5_K FFN tensor to the CPU will save more VRAM than offloading a Q3_K FFN tensor of similar architectural role. In the example above, blk.1.ffn_down.weight at Q5_K would save more VRAM if moved to CPU compared to blk.1.ffn_up.weight at Q3_K.
Craft Your Regex:
Use the –overridetensors flag with a regex pattern to match the FFN tensors you want to keep on the CPU. The user’s example targeted ffn_up tensors in specific alternating layers. You might experiment with patterns like:
A Note on CPU Threads
When using this method, remember to optimize your CPU thread count. A good starting point is to set –threads to one less than your total number of physical CPU cores. For instance, on a 12-core CPU, –threads 11 is a reasonable setting.
Why Isn’t Selective GGUF Tensor Offloading the Default?
This is a great question. The user who shared this experience wondered the same: “Why is this not standard?” Perhaps it’s due to the added complexity of configuration, or because the simpler layer-based offloading is easier to implement and understand initially.
However, given the significant performance gains observed, there’s a strong case for this to become more mainstream. Ideally, tools like llama.cpp could evolve to automatically identify and selectively offload these large, CPU-efficient FFN tensors based on available VRAM and tensor characteristics, making GGUF tensor offloading more accessible to everyone.
The “Wow” Moment: Witnessing Double the GGUF Speed
The excitement is palpable when a technical tweak yields such dramatic improvements. Going from nearly 4 TPS to over 10 TPS transforms the usability of a local LLM. The core principle is elegant: offload everything possible to your GPU, except for those very large FFN tensors that are VRAM hogs and can perform adequately on the CPU. This strategic GGUF tensor offloading is the key.

If you’ve been grappling with VRAM limitations and sluggish GGUF model speeds, this method is absolutely worth exploring.
The Future: Smarter GGUF Offloading Automation
The dream scenario, as voiced by the user, is for llama.cpp and similar inference engines to incorporate intelligent, automatic tensor-level offloading. Imagine the software analyzing your GGUF model and your system’s VRAM, then optimally distributing tensors between CPU and GPU without requiring complex manual regex configurations. This would be a huge step forward in user-friendliness and out-of-the-box performance for many.
Conclusion: Unleash Your GGUF Model’s True Potential with Tensor Offloading
If your GGUF model inference is slower than you’d like due to VRAM constraints preventing full GPU offload, the answer might not be a hardware upgrade, but a smarter offloading strategy. By selectively keeping large Feed-Forward Network (FFN) tensors on the CPU using the –overridetensors (or -ot) flag in koboldcpp or llama.cpp, you can free up critical GPU VRAM. This allows more, or even all, of the GPU-intensive parts of your model to run on the graphics card, potentially leading to speed increases exceeding 200%.
Don’t let VRAM limitations dictate your GGUF model’s speed. Experiment with GGUF tensor offloading, examine your model’s structure, and craft those regex overrides. You might be surprised by the performance you can unlock! Share your findings and help make this powerful optimization technique common knowledge.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure


