Skip to content

Conversation

@AlanPonnachan
Copy link
Contributor

@AlanPonnachan AlanPonnachan commented Nov 29, 2025

What does this PR do?

This PR adds support for MagCache (Magnitude-aware Cache), a training-free inference acceleration method for diffusion models, specifically targeting Transformer-based architectures like Flux.

This implementation follows the ModelHook pattern (similar to FirstBlockCache) to integrate seamlessly into Diffusers.

Key features:

  • MagCacheConfig: Configuration class to control threshold, retention ratio, and skipping limits.
  • Calibration Mode: Adds a calibrate=True flag. When enabled, the hook runs full inference and calculates/prints the magnitude ratios for the specific model and scheduler. This makes MagCache compatible with any transformer model (e.g., Hunyuan, Wan, SD3), not just Flux.
  • Strict Validation: To ensure correctness across different models, mag_ratios must be explicitly provided in the config (or calibration enabled).
  • Flux Support: Includes pre-computed FLUX_MAG_RATIOS as a constant for convenience, derived from the official implementation.
  • Mechanism: The hook calculates the accumulated error of the residual magnitude at each step. If the error is below the defined threshold, it skips the computation of the transformer blocks and approximates the output using the residual from the previous step.
    Fixes Magcache Support. #12697

Before submitting

Who can review?

@sayakpaul

@sayakpaul sayakpaul requested a review from DN6 November 29, 2025 06:19
@sayakpaul
Copy link
Member

@leffff could you review as well if possible?

@leffff
Copy link
Contributor

leffff commented Dec 2, 2025

Hi @AlanPonnachan @sayakpaul
The thing with MagCache is it requires precomputing magnitudes. @AlanPonnachan has done it for Flux, but how will this work for other models?

@leffff
Copy link
Contributor

leffff commented Dec 4, 2025

@AlanPonnachan ?

@AlanPonnachan
Copy link
Contributor Author

@leffff , Thank you for your review.

To address this, I am implementing a Calibration Mode.

My plan is to add a calibrate=True flag to MagCacheConfig. When enabled:

  1. The pipeline runs full inference (no skipping).
  2. The hook calculates the residual magnitude ratios at every step.
  3. At the end of inference, it logs/prints the resulting array of ratios.

Users can then simply run one calibration pass for their specific model/scheduler, copy the output ratios, and pass them into MagCacheConfig(mag_ratios=[...]) for optimized inference. This makes the implementation completely model-agnostic.

I am working on this update now and will push the changes shortly!

@leffff
Copy link
Contributor

leffff commented Dec 4, 2025

@leffff , Thank you for your review.

To address this, I am implementing a Calibration Mode.

My plan is to add a calibrate=True flag to MagCacheConfig. When enabled:

  1. The pipeline runs full inference (no skipping).
  2. The hook calculates the residual magnitude ratios at every step.
  3. At the end of inference, it logs/prints the resulting array of ratios.

Users can then simply run one calibration pass for their specific model/scheduler, copy the output ratios, and pass them into MagCacheConfig(mag_ratios=[...]) for optimized inference. This makes the implementation completely model-agnostic.

I am working on this update now and will push the changes shortly!

Sounds great!
I am not a Diffusers maintainer, but i believe, making such calibration will indeed make it universal. (This is similar to compiling). I believe after this update, this will be completely usable!

@sayakpaul
Copy link
Member

Thanks for the thoughtful discussions here @AlanPonnachan and @leffff! I will leave my two cents below:

  • The calibration steps outlined in Add support for Magcache  #12744 (comment) are great! What if we ship a utility to just log/print those values so that users can pass them to the MagCacheConfig? We could provide that utility script either from scripts or from src/diffusers/utils. I think this will be more explicit and enforce some kind of user awareness.
  • If the mag_ratios are supposed to checkpoint-dependent, I think we should always enforce passing mag_ratios from the config and when not provided, raise a sensible error message that instructs the user on how to derive mag_ratios.

Ccing @DN6 to get his thoughts here, too.

@sayakpaul sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Dec 4, 2025
@AlanPonnachan
Copy link
Contributor Author

Thanks @sayakpaul and @leffff for the feedback!

I have updated the PR to address these points. Instead of a standalone utility script, I integrated the calibration logic directly into the hook configuration for better usability:

  1. Strict Enforcement: mag_ratios is now mandatory. If not provided (and calibrate=False), a ValueError is raised with instructions on how to derive them.
  2. Calibration Mode: I added a calibrate=True flag to MagCacheConfig. When enabled, the hooks run full inference (no skipping) and log/print the calculated magnitude ratios at the end. This allows users to easily generate ratios for any model/scheduler combination using their existing pipeline code.
  3. Flux Convenience: I kept FLUX_MAG_RATIOS as a constant for convenience, but the user must now explicitly import and pass it to the config.

Ready for review!

@leffff
Copy link
Contributor

leffff commented Dec 4, 2025

Looks Great! Could you please provide a usage example:

  1. import & load a specific model
  2. Inference
  3. Calibrate
  4. Inference w MagCahce

And Provide Generations

To be Sure it works, please provide generations for SD3.5 Medium, Flux, Wan T2V 2.1 1.3b I also believe, as caching is suitable for all tasks, can we also try Kandinsky 5.0 Video Pro I2V kandinskylab/Kandinsky-5.0-I2V-Pro-sft-5s-Diffusers

@AlanPonnachan
Copy link
Contributor Author

AlanPonnachan commented Dec 7, 2025

@leffff

1. Usage Example

 import torch
 from diffusers import FluxPipeline from diffusers.hooks import MagCacheConfig, apply_mag_cache

 pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16).to("cuda")

 # CALIBRATION STEP
 config = MagCacheConfig(calibrate=True, num_inference_steps=4)
 apply_mag_cache(pipe.transformer, config)
 pipe("A cat playing chess", num_inference_steps=4)
 # Logs: [1.0, 1.37, 0.97, 0.87]

 # INFERENCE STEP
 config = MagCacheConfig(mag_ratios=[1.0, 1.37, 0.97, 0.87], num_inference_steps=4)
 apply_mag_cache(pipe.transformer, config) 
 pipe("A cat playing chess", num_inference_steps=4)

2. Benchmark Results

I validated the implementation on Flux, SD 3.5, and Wan 2.1 using a T4 Colab environment.

Model Baseline Time MagCache Time Speedup Generated Ratios (First 5) Notes
Flux.1-Schnell ~10m 31s ~7m 55s ~1.33x [1.0, 1.371991753578186, 0.9733748435974121, 0.8640348315238953] Full generation successful.
SD 3.5 Medium ~4m 46s

~4m 51s
~1m 36s (threshold = 0.15)

~2m 43s (threshold = 0.03)
~3.0x (threshold = 0.15)

~1.79x (threshold = 0.03)
threshold = 0.15: [1.0, 1.0182535648345947, 1.0475366115570068, 1.0192866325378418, 1.007051706314087, 1.013611078262329, 1.0057004690170288, 1.0053653717041016, 0.9967299699783325, 0.9996473789215088, 0.9947380423545837, 0.9942205548286438, 0.9788764715194702, 0.9873758554458618, 0.9801908731460571, 0.9658506512641907, 0.9565740823745728, 0.9469784498214722, 0.9258849620819092, 1.3470091819763184]

threshold = 0.03: [1.0, 1.0172510147094727, 1.0381698608398438, 1.0167241096496582, 1.0070651769638062, 1.0107033252716064, 1.0043275356292725, 1.0044840574264526, 0.9945924282073975, 0.9993497133255005, 0.9941253662109375, 0.9904510974884033, 0.9783601760864258, 0.9845271110534668, 0.9771078824996948, 0.9657461047172546, 0.9529474973678589, 0.9403719305992126, 0.9110836982727051, 1.3032703399658203]
Validated hooks without T5 encoder (RAM limit).
Wan 2.1 (1.3B) ~22s ~1s ~22x [1.0, 0.9901599884033203, 0.9980327486991882, 1.001886248588562, 1.0045758485794067, 1.0067006349563599, 1.0093395709991455, 1.0129660367965698, 1.0191177129745483, 1.0308380126953125] Validated hooks with dummy embeddings (RAM limit).
Kandinsky 5.0 N/A N/A N/A N/A Added visual_transformer_blocks support, but hit disk limits. Logic matches Wan/Flux.

3. Generations

Attached below are the outputs for the successful runs.

Flux (Baseline):
flux_baseline

Flux (MagCache):
flux_magcache

SD 3.5 (Baseline):
sd35_baseline

SD 3.5 (MagCache -- threshold = 0.15):
sd35_magcache

SD 3.5 (Baseline):
sd35_baseline (1)

SD 3.5 (MagCache -- threshold = 0.03):
sd35_magcache (1)

@AlanPonnachan
Copy link
Contributor Author

Here is the Colab notebook used to generate the benchmarks above. It includes the full setup, memory optimizations (sequential offloading/dummy embeds), and the execution logs:

magcache inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Anything related to performance improvements, profiling and benchmarking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Magcache Support.

3 participants