Digital Product Studio

ParaAttention Speeds Up HunyuanVideo Inference with Context Parallelism and First Block Cache

In AI, video generation has recently experienced remarkable advancements. Among the various AI video models available, HunyuanVideo, developed by Tencent, stands out due to its advanced capabilities. However, the challenge lies in its inference speed, which can hinder practical deployment. Enter ParaAttention, a library designed to enhance the inference speed of HunyuanVideo and similar models like CogVideoX, Mochi, and FLUX. Let’s delve into ParaAttention and how it optimizes inference speeds for HunyuanVideo.

How ParaAttention Works

ParaAttention enhances the model inference speeds through the implementation of Context Parallelism and First Block Cache strategies. By leveraging these techniques, ParaAttention significantly reduces the computation cost associated with generating video frames, enabling quicker and more efficient video production without sacrificing quality. Furthermore, it employs additional optimizations, including torch.compile and FP8 Dynamic Quantization, to further enhance performance.

The Need for Speed in Video Generation

As the demand for real-time video applications increases, so does the necessity for faster inference speeds. Traditional models often fall short in meeting the requirements of applications that rely on instant video generation. ParaAttention addresses this gap by providing a toolkit that allows developers to optimize their existing models, ensuring that inference is swift and efficient.

Key Features of ParaAttention

ParaAttention’s architecture is designed with several key features that contribute to its effectiveness:

1. Context Parallelism

Context Parallelism (CP) is a method that allows the parallel processing of neural network activations across multiple GPUs. This technique enhances the performance of models by partitioning input tensors along the sequence dimension, enabling faster computation and improved efficiency.

2. First Block Cache

The First Block Cache (FBCache) serves as a dynamic caching mechanism that reduces redundant computations during inference. By utilizing the residual output of the first transformer block as a cache indicator, FBCache allows the model to skip computations when the output differences are minimal, resulting in significant speed improvements.

3. Torch Compile Integration

Integrating torch.compile into the inference pipeline allows for further optimizations by enabling the backend compiler to enhance performance through effective graph optimization. This integration ensures that heavy computations are captured in a single graph, maximizing the opportunities for optimization.

4. FP8 Dynamic Quantization

Utilizing FP8 dynamic quantization helps reduce memory usage and increase inference speed by allowing the model to operate with 8-bit precision. This optimization is particularly effective on NVIDIA GPUs, enabling the use of Tensor Cores for improved performance.

Supported Models

ParaAttention is designed to work with several popular AI video generators, including:

  • HunyuanVideo
  • FLUX
  • Mochi
  • CogVideoX

Each of these models can benefit from the advanced features offered by ParaAttention, enabling faster inference without sacrificing quality. 

Setting Up HunyuanVideo with ParaAttention

To leverage ParaAttention for optimizing HunyuanVideo, follow these steps:

Step 1: Install Required Libraries

Ensure that you have the latest versions of the necessary libraries installed. This includes ParaAttention and diffusers, which provide the framework for video generation.

pip3 install -U diffusers

pip3 install -U para-attn

Step 2: Load the Model

Begin by importing the necessary modules and loading the HunyuanVideo model. The following code snippet demonstrates this process:

import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
    revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.float16,
    revision="refs/pr/18",
).to("cuda")

Step 3: Implement First Block Cache

To enable the First Block Cache, apply the caching mechanism to the pipeline:

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(pipe, residual_diff_threshold=0.0)

Step 4: Run Inference

Once the model is set up, run inference to generate video frames. The following code snippet illustrates how to generate frames and save the output:

begin = time.time()
output = pipe(
    prompt="A cat walks on the grass, realistic",
    height=720,
    width=1280,
    num_frames=129,
    num_inference_steps=30,
).frames[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")

print("Saving video to hunyuan_video.mp4")
export_to_video(output, "hunyuan_video.mp4", fps=15)

This is the baseline setup. 

Optimizing HunyuanVideo Inference Speed with ParaAttention

1. Applying First Block Cache

By caching the outputs of transformer blocks, ParaAttention enables the reuse of previous computations, resulting in faster inference. To apply this optimization, set the appropriate threshold value in the caching function:

apply_cache_on_pipe(pipe, residual_diff_threshold=0.035)

This adjustment allows the model to skip unnecessary denoising steps when the output differences fall below the specified threshold, effectively halving the computation time for each inference step.

HunyuanVideo with FBCache

The first block cache is very effective in speeding up the inference and maintaining nearly no quality loss in the generated video. 

2. Dynamic Quantization for Enhanced Performance

Dynamic quantization can further enhance inference speed. By quantizing both the activation and weight of the model to FP8, developers can significantly reduce memory usage while improving processing speed. This is facilitated by the following code:

pip3 install -U torch torchao

We also need to pass the model to torch.compile to generate and select the best kernel for the model inference. The compilation process could take some time.

import time
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
    revision="refs/pr/18",
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=torch.float16,
    revision="refs/pr/18",
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(pipe)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only

quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
   pipe.transformer, mode="max-autotune-no-cudagraphs",
)

This step ensures that the model operates efficiently under high-resolution conditions, mitigating the risk of out-of-memory (OOM) errors.

3.  Parallelizing Inference with Multiple GPUs

Context parallelism can be utilized across multiple GPUs to achieve maximum performance. This can be accomplished by calling:

import time
import torch
import torch.distributed as dist

from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

dist.init_process_group()
torch.cuda.set_device(dist.get_rank())

Parallelizing the HunyuanVideo Inference with NVIDIA L20 GPUs

For 2 GPUs:

torchrun --nproc_per_node=2 run_hunyuan_video.py

4 GPUs:

torchrun --nproc_per_node=4 run_hunyuan_video.py

8 GPUs:

torchrun --nproc_per_node=8 run_hunyuan_video.py

This command executes the inference across multiple GPUs, resulting in drastically reduced processing times.

Evaluating Performance Improvements

1. Base Inference on a Single NVIDIA L20 GPU: Time to generate 129 frames at 720p resolution with 30 inference steps: 3626.33 seconds

2. Applying First Block Cache (FBCache): Time to generate 129 frames at 720p resolution with 30 inference steps: 2271.06 seconds. With this technique, HunyuanVideo achieved 1.59x speedup compared to the baseline

3. Parallelizing Inference with Context Parallelism:

  • Using 2 NVIDIA L20 GPUs with FBCache: Time to generate 129 frames at 720p resolution with 30 inference steps: 1132.90 seconds, a 3.20x speedup compared to the baseline
  • Using 4 NVIDIA L20 GPUs with FBCache: Time to generate 129 frames at 720p resolution with 30 inference steps: 718.15 seconds, a 5.05x speedup compared to the baseline
  • Using 8 NVIDIA L20 GPUs with FBCache: Time to generate 129 frames at 720p resolution with 30 inference steps: 649.23 seconds, a 5.58x speedup compared to the baseline.

4. Combining Optimizations

FBCache + Context Parallelism + torch.compile + FP8 Dynamic Quantization

This technique achieves the fastest HunyuanVideo inference, with up to 5.58x speedup on NVIDIA L20 GPUs compared to the baseline.

These figures indicate a significant enhancement in inference speed, showcasing the effectiveness of the techniques employed by ParaAttention.

Practical Applications

ParaAttention is particularly beneficial in scenarios where inference speed is paramount. Applications in fields such as video processing, natural language processing, and image generation can greatly benefit from the accelerated performance provided by this library. The advancements brought forth by ParaAttention also have far-reaching implications across multiple industries. In sectors such as entertainment, gaming, and education, the ability to generate high-quality video content in real-time can revolutionize how content is created and consumed. Furthermore, as AI continues to integrate into everyday applications, the demand for efficient models will only increase, making ParaAttention an essential tool for developers.

The Future of AI Video Generation with ParaAttention

ParaAttention’s comprehensive optimization suite empowers creators, researchers, and developers to unlock the true potential of HunyuanVideo and other generative AI video models. By delivering blazing-fast inference speeds, even on low-VRAM devices, ParaAttention paves the way for real-time applications, seamless deployment, and the continued evolution of the generative AI landscape. Whether you’re working on video generation, image-to-text translation, or any other generative AI task, ParaAttention’s easy-to-use interface and powerful optimization techniques can help you achieve unprecedented performance and unlock new creative possibilities. For more information and to access the ParaAttention library, visit the the ParaAttention GitHub repository.

| Latest From Us

SUBSCRIBE TO OUR NEWSLETTER

Stay updated with the latest news and exclusive offers!


* indicates required
Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

Don't Miss Out on AI Breakthroughs!

Advanced futuristic humanoid robot

*No spam, no sharing, no selling. Just AI updates.