Contents
Introduction
How do you align generative AI with human preferences? This question has driven remarkable progress in both large language models and text-to-image generation. The formula is well-established: train on massive web-scale data, then align the model through curated datasets and reinforcement learning from human feedback (RLHF)[1], [2], [3]. Systems like FLUX[5] and Stable Diffusion 3[4] follow exactly this recipe.
The Problem
But this paradigm has a cost. Post-hoc alignment discards informative "low-quality" data[6], complicates training with an additional optimization stage, and often overfits to a single reward — leading to mode collapse, reduced diversity, or degraded semantic fidelity.
Our Solution: MIRO
Our answer is MIRO (MultI-Reward cOnditioning)—a framework that integrates multiple reward signals directly into the pretraining objective for text-to-image generation. Similar to previous work[6], we condition the generative model on a vector of reward scores per text-image pair. The rewards span aesthetics, user preference, semantic correspondence, visual reasoning, and domain-specific correctness. This way, the model learns an explicit mapping from desired reward levels to visual characteristics—right from the beginning.
This simple change has powerful consequences: it preserves the full spectrum of data quality instead of filtering it out, turns alignment into a controllable variable at inference time, and by providing rich supervision at scale, accelerates convergence and improves sample efficiency.
Key Benefits
MIRO eliminates separate fine-tuning or RL stages by integrating reward alignment directly into pretraining. Rich supervision at scale enables convergence up to 19× faster than regular pretraining.
Unlike pipelines that discard "low-quality" data, MIRO trains across the entire reward spectrum. This reduces mode collapse and yields representations that generalize across all quality levels.
Users can dial individual rewards up or down at inference time to achieve precise trade-offs—boost aesthetics without collapsing alignment, or prioritize compositional correctness for complex prompts.
Single-objective optimization often leads to reward hacking[14] where models exploit specific metrics. MIRO's multi-dimensional conditioning naturally prevents this by balancing multiple objectives simultaneously.
Our Contributions
- We propose MIRO: reward-conditioned pretraining that integrates multiple rewards directly during training, eliminating the need for post-hoc alignment.
- State-of-the-art performance: Our small 350M-parameter model trained on just 16M images achieves top scores on GenEval[11] and user-preference metrics[10], [8], [9], outperforming much larger models like FLUX-dev[5] (12B parameters) trained for much longer.
- Unprecedented efficiency: MIRO converges up to 19× faster than regular training and achieves comparable quality with 370× less inference compute than FLUX[5].
Method
Our method consists of three key components that work together to enable efficient, controllable text-to-image generation:
Dataset Augmentation
Enrich the pretraining dataset with reward annotations across multiple quality dimensions
Multi-Reward Conditioned Training
Modify the flow matching objective to incorporate reward signals directly into the generative process
Reward-Guided Inference
Enable fine-grained control over generation quality through explicit reward conditioning during sampling
Problem Formulation
Let \(\mathcal{D} = \{(x^{(i)}, c^{(i)})\}_{i=1}^{M}\) be a large-scale pretraining dataset where \(x^{(i)} \in \mathbb{R}^{H \times W \times 3}\) represents an image and \(c^{(i)} \in \mathcal{T}\) represents the corresponding text condition (e.g., caption, prompt). Traditional pretraining learns a generative model \(p_\theta(x|c)\) that captures the joint distribution of images and text without explicit quality control.
In contrast, we consider a set of \(N\) reward models \(\mathcal{R} = \{r_1, r_2, \ldots, r_N\}\) where each \(r_j: \mathbb{R}^{H \times W \times 3} \times \mathcal{T} \rightarrow \mathbb{R}\) evaluates different aspects of image quality. Our goal is to learn a conditional generative model \(p_\theta(x|c, \mathbf{s})\) where \(\mathbf{s} = [s_1, s_2, \ldots, s_N]\) represents the desired reward levels, enabling controllable generation across multiple quality dimensions.
Dataset Augmentation with Reward Scores
The first step of MIRO involves augmenting the pretraining dataset with comprehensive reward annotations. For each sample \((x^{(i)}, c^{(i)}) \in \mathcal{D}\), we compute reward scores across all \(N\) reward models:
This process transforms our dataset into an enriched version \(\tilde{\mathcal{D}} = \{(x^{(i)}, c^{(i)}, \mathbf{s}^{(i)})\}_{i=1}^{M}\) where \(\mathbf{s}^{(i)} = [s_1^{(i)}, s_2^{(i)}, \ldots, s_N^{(i)}]\) contains the multi-dimensional quality assessment for each sample.
MIRO Training Pipeline: Images and captions are evaluated by multiple reward models, producing a score vector ŝ. The noised image, caption, and scores condition the denoiser, teaching it to map reward levels to visual characteristics.
Multi-Reward Conditioned Flow Matching
Having augmented our dataset with reward scores, we now incorporate these signals into the generative model architecture. We build upon flow matching[12], a powerful framework for training continuous normalizing flows that has shown excellent performance in high-resolution image generation.
Training Objective. Following the standard flow matching formulation, we sample noise \(\epsilon \sim \mathcal{N}(0, I)\) and time \(t \sim \mathcal{U}(0, 1)\), then compute the noisy sample \(x_t = (1-t)x + t\epsilon\). The multi-reward flow matching loss becomes:
This objective trains the model to predict the difference between the noise and the clean image, conditioned on both the text prompt and the desired quality levels. The model learns to associate different reward levels with corresponding visual characteristics, enabling reward-aware generation.
Training Dynamics. During training, the model observes the full spectrum of quality levels for each reward dimension. This exposure allows it to learn the relationship between reward values and visual features, from low-quality samples that may exhibit artifacts or poor composition to high-quality samples with superior aesthetics and text alignment. The model learns the entire reward landscape rather than overfitting to a single objective.
Reward-Guided Inference
At inference time, MIRO provides unprecedented control over the generation process through explicit reward conditioning. We offer three complementary sampling strategies:
Maximize all rewards simultaneously for best overall quality
Amplify quality through positive/negative reward contrasts
Fine-tune individual reward dimensions on demand
1. High-Quality Generation
For generating high-quality samples, we condition the model on maximum reward values across all \(N\) dimensions: \(\hat{\mathbf{s}}_{\mathrm{max}} = [B-1, B-1, \ldots, B-1]\). This instructs the model to generate samples that maximize all reward objectives simultaneously.
2. Multi-Reward Classifier-Free Guidance
We extend classifier-free guidance to the multi-reward setting by leveraging the reward conditioning mechanism. Following the Coherence-Aware CFG approach[6], we introduce a positive and a negative reward target, denoted \(\hat{\mathbf{s}}^{+}\) and \(\hat{\mathbf{s}}^{-}\), which can be chosen by the user for controllability.
Glossary
- \(x_t=(1-t)x + t\epsilon\): noised sample
- \(\hat{\mathbf{s}}\): reward targets (Aesthetic, Pick, CLIP, HPSv2, ImageReward, ...)
- \(\omega\): guidance scale
- \(\hat{\mathbf{s}}^{+}\)/\(\hat{\mathbf{s}}^{-}\): positive/negative reward targets
- \(B\): number of bins for reward normalization
MIRO Inference Pipeline: A prompt, along with positive (s+) and negative (s-) reward targets, are input to the MIRO model. Starting from noise, the model iteratively denoises the image, guided by the reward targets, to produce the final high-quality output.
By amplifying this direction with the guidance scale \(\omega\), we push generated samples toward parts of the distribution characterized by superior aesthetic quality, text alignment, and other desired attributes. Similar to the weak guidance framework[13], where a bad version of the model guides the good version, here guidance is provided by the contrast between high-reward and low-reward conditioning.
3. Flexible Reward Trade-offs
A key advantage of MIRO is the ability to specify custom reward targets at inference time. Users can set \(\hat{\mathbf{s}}_{\mathrm{custom}} = [\hat{s}_1, \hat{s}_2, \ldots, \hat{s}_N]\) where each component represents the desired level for that reward dimension.
Interactive Reward Control: As we adjust the reward weights in real-time, the generated output changes to reflect the new quality targets. The histogram on the left shows the current reward configuration, while the MIRO model processes these weights to generate the corresponding output on the right. Notice how different reward combinations lead to distinct visual characteristics.
Example use cases:
- Dial up aesthetics without collapsing text alignment
- Prioritize compositional correctness for complex prompts
- Balance multiple objectives according to application needs
- Explore the quality-alignment frontier interactively
Training efficiency
MIRO demonstrates significant training efficiency gains compared to traditional pretraining. The charts below show convergence curves across different reward metrics. This dramatic acceleration stems from the additional supervisory signal provided by reward conditioning—by teaching the model to associate reward levels with visual characteristics from the start, we provide rich supervision at every training step.
- Rich supervision at every step: Multi-reward conditioning adds informative gradients throughout training, enabling the model to learn quality-aware generation much faster than discovering these associations implicitly.
- Consistent acceleration: The efficiency gains are consistent across all reward dimensions, demonstrating that MIRO's benefits generalize beyond any single metric.
Qualitative evidence of accelerated convergence
The progression visualizations below provide compelling qualitative evidence of MIRO's training acceleration. Use the slider to step through training checkpoints and observe how quickly MIRO learns compared to the baseline:
- "Tiger in a tuxedo": MIRO establishes proper compositional layout (tiger wearing formal attire) and generates a visually appealing result within 50k training steps, while the baseline requires 200k steps to reach comparable quality.
- "Mad scientist panda": MIRO rapidly converges to aesthetically pleasing results with recognizable characters and correct attributes. The baseline model fails to generate a recognizable panda until 400k steps.
These qualitative improvements directly complement our quantitative findings above, demonstrating that MIRO's multi-reward conditioning doesn't just improve metrics—it enables fundamentally faster learning of complex compositional concepts and visual aesthetics.
Synergizing with test-time scaling
Test-time scaling—generating multiple samples and selecting the best one—has emerged as a popular method to improve reward performance[15]. We demonstrate that MIRO achieves superior sample efficiency compared to baseline models when combined with this technique. The charts below show performance across varying sample counts (1 to 128 samples, displayed on a log-2 scale).
Quantifying inference-time efficiency improvements: The efficiency gains are particularly striking for specific metrics. For ImageReward, MIRO with 8 samples matches the performance of the baseline with 128 samples, representing a 16× efficiency improvement. For PickScore, MIRO achieves equivalent performance with only 4 samples compared to the baseline's 128 samples, demonstrating a remarkable 32× efficiency gain.
- Single-sample dominance: For Aesthetic Score and HPSv2, MIRO's single sample surpasses what the baseline achieves even with 128 samples—this dramatic efficiency gain highlights MIRO's ability to generate high-quality samples without requiring extensive test-time computation.
- Consistent advantages: MIRO consistently outperforms the baseline across all reward metrics, establishing it as not only a superior training approach but also a more efficient inference-time method.
Balanced performance across all metrics
MIRO outperforms single-reward approaches
We evaluate three training configurations on the CC12M+LA6 dataset: (1) a baseline model trained without reward conditioning, (2) single-reward models conditioned on individual rewards (similar to Coherence Aware Diffusion[6] but using our reward suite instead of CLIP score), and (3) MIRO conditioned on all seven rewards simultaneously. The radar plots below show results across AestheticScore, PickScore, ImageReward, HPSv2, and JINA CLIP score, plus OpenAI CLIP score as an out-of-distribution metric not used during training.
- Multi-objective optimization: By optimizing across multiple complementary objectives simultaneously, MIRO avoids the overfitting that occurs when models focus exclusively on a single reward signal.
- Balanced gains: MIRO consistently outperforms all baselines across aesthetic and preference metrics, demonstrating the effectiveness of multi-reward conditioning.
Select a comparison model above to visualize how MIRO performs against different baselines. Notice how single-reward models excel on their target metric but often underperform on others, while MIRO maintains strong performance across all dimensions.
Enhancing compositional understanding
Beyond optimizing for specific reward metrics, MIRO demonstrates significant improvements in text-image alignment as measured by GenEval. This enhancement is particularly pronounced in challenging compositional reasoning tasks:
29 → 38
+31% improvement
55 → 68
+24% improvement
49 → 55
+12% improvement
These results demonstrate that MIRO's multi-reward conditioning enables better understanding of complex spatial relationships, object interactions, and numerical concepts—achieving balanced optimization that excels across diverse evaluation criteria while maintaining strong performance on individual metrics.
MIRO and synthetic captions
Synthetic captioning has emerged as the go-to method for improving text-image alignment in generative models, offering the advantage of retaining all training data without filtering based on caption quality. We evaluate MIRO using a mixture of 50% synthetic and 50% real captions.
MIRO outperforms synthetic captioning alone
Our results demonstrate that MIRO without synthetic captions achieves comparable GenEval performance to baseline models trained with synthetic captions. More importantly, MIRO without synthetic captions significantly outperforms the synthetic caption baseline across reward metrics.
Combining MIRO with synthetic captions
Combining MIRO with synthetic captions yields the strongest overall performance. While maintaining equivalent aesthetic quality to MIRO without synthetic captions, this combined approach achieves a remarkable GenEval score of 68, substantially improving over the synthetic caption baseline of 57 (+19%). The improvements are consistent across all compositional reasoning metrics:
30 → 46
+53% improvement
43 → 52
+21% improvement
58 → 73
+26% improvement
44 → 61
+39% improvement
93 → 97
+4% improvement
57 → 68
+19% improvement
These comprehensive gains across all compositional aspects demonstrate that MIRO effectively benefits from synthetic captions for text-image alignment, achieving superior compositional understanding while preserving aesthetic quality.
Detailed comparison: The charts below show GenEval categories (left) and aesthetic metrics (right) for Baseline/MIRO with real captions and Synth Baseline/Synth MIRO with 50/50 synthetic captions.
Flexible reward trade-offs at inference
Reward weighting exposes controllable trade-offs
Our test-time scaling results show that selecting samples by Aesthetic Score can reduce GenEval performance, indicating a trade-off between aesthetic quality and semantic alignment. By steering the reward vector at inference, users can choose where to land on the aesthetics ↔ alignment frontier—a key practical advantage of training on reward vectors rather than aligning after the fact.
Weight 0.625 → GenEval 75
Maximizes compositional understanding.
1 sample = 128 samples
Single weighted sample matches best TTS results.
Real-time exploration
Use the slider to see all metrics change live.
Visualizing per-reward controllability
The images below show how MIRO responds when we isolate each reward dimension. For each column, we set the positive target to maximize all rewards and the negative target to zero out one specific reward while keeping others high. This cancels the shared direction and isolates that reward's unique visual characteristics. Notice how each reward emphasizes different aspects: aesthetics affects overall visual appeal, CLIP strengthens text-image correspondence, HPSv2 influences composition, and so on.
Overall visual appeal and artistic quality
Text-image semantic correspondence
Composition and user preference
Human preference alignment
Pairwise reward exploration
We use multi-reward classifier-free guidance to isolate the trade-off between two specific rewards while keeping all others anchored at their maximum values. Given two selected rewards A and B, we set the positive target to maximum for all rewards: \(\hat{\mathbf{s}}^{+} = [B-1, \ldots, B-1]\). For the negative target, we also set all rewards to maximum except for the two rewards being interpolated:
where \(t \in [0,1]\) is the interpolation parameter. When \(t=0\), the guidance direction \(v_\theta(x_t, c, \hat{\mathbf{s}}^{+}) - v_\theta(x_t, c, \hat{\mathbf{s}}^{-})\) emphasizes reward B while suppressing reward A; when \(t=1\), the opposite occurs. This formulation cancels out the shared quality direction (since all other rewards remain constant in both \(\hat{\mathbf{s}}^{+}\) and \(\hat{\mathbf{s}}^{-}\)) and isolates the specific visual characteristics associated with the trade-off between rewards A and B.
Use the dropdowns below to select two rewards, then adjust the slider to vary \(t\) from 0 to 1. The interface displays one of 32 generated images spanning this spectrum, allowing you to explore how the model navigates different quality trade-offs.
Comparison to State-of-the-Art Models
We evaluate MIRO against state-of-the-art text-to-image models including FLUX-dev, Stable Diffusion variants, PixArt, Sana, and SDXL. The animated charts below visualize performance on GenEval (compositional understanding) and ImageReward (user preference), with models sorted from lowest to highest scores. MIRO models are highlighted in orange.
GenEval: Exceptional training efficiency
MIRO achieves a GenEval score of 68 with synthetic captions, already outperforming FLUX-dev (12B parameters) which scores 67. With optimized inference-time reward weighting, MIRO reaches 75—setting a new state-of-the-art while requiring dramatically less computation: 4.16 TFLOPs vs 1540 TFLOPs for FLUX-dev, a remarkable 370× efficiency improvement. This demonstrates that MIRO's multi-reward conditioning enables compact models to surpass much larger architectures.
Setting new benchmarks for compositional reasoning
Score: 46
+35% over previous SOTA (SD3-Medium: 34)
Score: 52
+11% over previous SOTA (FLUX-dev: 47)
Beyond overall performance, MIRO excels on challenging compositional metrics that have historically been difficult for text-to-image models. On the Position metric, MIRO achieves a score of 46, improving upon the previous state-of-the-art of 34 (SD3-Medium) by 35%. For Color Attribution, MIRO advances from FLUX-dev's previous best of 47 to 52 (+11%). These improvements highlight MIRO's superior understanding of complex spatial relationships and object attributes.
The chart below compares MIRO's GenEval performance against all SOTA baselines, with models sorted from lowest to highest scores. MIRO models are highlighted in orange.
User preference: Scalable efficiency with test-time optimization
On PartiPrompts, MIRO consistently outperforms larger models across multiple reward metrics. When optimizing for ImageReward with 128-sample inference scaling, MIRO achieves a state-of-the-art score of 1.61 compared to FLUX-dev's 1.19 and Sana-1.6B's 1.23.
Remarkably, even with this 128-sample inference scaling strategy, MIRO maintains a 3× efficiency advantage over FLUX-dev (532 TFLOPs vs 1540 TFLOPs) while achieving superior performance across all metrics. For Aesthetic Score optimization, MIRO reaches 6.81 compared to FLUX-dev's 6.56.
The chart below shows ImageReward scores across all models, demonstrating MIRO's superior user preference alignment. Models are sorted from lowest to highest, with MIRO variants in orange.
Cross-metric generalization through multi-reward conditioning
A key advantage of MIRO is its ability to generalize across metrics without explicit optimization. For instance, when optimizing for HPSv2, MIRO achieves an ImageReward score of 1.35, outperforming models specifically trained for that metric. This cross-metric robustness demonstrates that multi-reward conditioning naturally learns generalizable quality representations rather than exploiting individual metric idiosyncrasies.
Computational efficiency comparison
The charts below visualize the dramatic difference in model size and computational requirements using logarithmic scales to better show the magnitude of MIRO's efficiency gains. MIRO's compact 350M-parameter architecture requires 33× fewer parameters than FLUX-dev and 370× less compute for inference, while achieving superior performance on both metrics above.
Note: Efficiency charts above use green for MIRO models and slate-gray for baselines, with lower values being better. Both charts use logarithmic scales to show the magnitude of differences. Bars animate on load, growing and then sorting from lowest to highest values.
Conclusion
We presented MIRO (MultI-Reward cOnditioning), a simple yet powerful pretraining framework that integrates alignment directly into training rather than treating it as a post-hoc stage. By conditioning on a vector of reward scores, MIRO learns \(p(x\mid c, \mathbf{s})\) and exposes reward targets as controllable inputs, disentangling content from quality and offering precise, interpretable control at inference time.
What We Achieved
MIRO converges up to 19× faster than regular pretraining on aesthetic metrics, accelerating development cycles dramatically.
Achieves comparable quality with 370× less inference compute than FLUX-dev, making high-quality generation accessible.
Outperforms FLUX-dev (12B params) on compositional understanding with just 350M parameters—a new efficiency benchmark.
Simultaneously optimizes aesthetics, user preference, semantic alignment, and more—preventing reward hacking.
Fine-grained control over quality trade-offs at inference time without retraining or model collapse.
Preserves the full spectrum of training data instead of discarding "low-quality" samples, maximizing learning.
Key Takeaways
MIRO eliminates the need for separate fine-tuning or RL stages, simplifying the training pipeline while achieving superior results.
By optimizing multiple rewards simultaneously, MIRO prevents reward hacking and mode collapse that plague single-objective approaches.
Despite being much smaller (350M vs 12B params), MIRO surpasses FLUX-dev on GenEval and PartiPrompts at a fraction of the computational cost.
Users can dial individual rewards up or down at generation time, achieving precise trade-offs without expensive test-time search or retraining.
Looking Forward
We believe MIRO opens a new direction for leveraging reward models in generative AI. Rather than treating alignment as a correction mechanism applied after the fact, integrating rewards from the beginning enables models that are faster to train, more controllable, and more efficient to deploy.
Beyond Text-to-Image Generation
This paradigm shift—from post-hoc alignment to reward-conditioned pretraining—could extend to other domains where multiple quality dimensions matter: large language models (balancing helpfulness, harmlessness, and accuracy), video generation (temporal consistency, motion quality, aesthetics), 3D synthesis (geometric accuracy, visual realism, physical plausibility), and audio generation (fidelity, naturalness, clarity).
Personalized Reward Spaces
An exciting future direction is discovering a basis of fundamental reward dimensions that could represent any user preference. Just as colors can be composed from RGB primaries, could we find a minimal set of reward "basis vectors" from which any personalized reward emerges as a linear combination? MIRO's multi-reward conditioning framework could then enable users to define custom quality trade-offs on the fly—dialing in their unique preferences without requiring new models or expensive retraining. This would transform alignment from a one-size-fits-all solution into a personalized, interpretable control surface over generation quality.
Acknowledgements
This work was supported by ANR project TOSAI ANR-20-IADJ-0009, and was granted access to the HPC resources of IDRIS under the allocation 2024-A0171014246 made by GENCI.
We would like to thank Alyosha Efros, Tero Karras, and Luca Eyring for their helpful comments, and Yuanzhi Zhu and Xi Wang for proofreading.
Citation
If you find MIRO useful for your research, please consider citing our paper:
@article{dufour2025miro,
  title   = {MIRO: Multi-Reward Conditioned Pretraining for Text-to-Image Generation},
  author  = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2025}
}Get Started
References
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
- Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., & Lee, K. (2023). DDPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models. NeurIPS 2023.
- Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICLR 2024.
- Black Forest Labs. (2024). FLUX. https://github.com/black-forest-labs/flux
- Dufour, N., Besnier, V., Kalogeiton, V., & Picard, D. (2024). Don't drop your samples! Coherence-aware training benefits Conditional diffusion. CVPR 2024.
- Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). LAION-5B: An Open Large-scale Dataset for Training Next Generation Image-Text Models. NeurIPS 2022.
- Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., & Li, H. (2023). Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv preprint.
- Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., & Levy, O. (2023). Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. NeurIPS 2023.
- Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., & Dong, Y. (2023). ImageReward: Learning and Leveraging Human Preferences for Text-to-Image Generation. NeurIPS 2023.
- Ghosh, D., Hajishirzi, H., & Schmidt, L. (2024). Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36.
- Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. arXiv preprint.
- Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., & Laine, S. (2024). Guiding a diffusion model with a bad version of itself. NeurIPS 2024.
- Luo, Y., Hu, T., Luo, W., Kawaguchi, K., & Tang, J. (2025). Reward-Instruct: A Reward-Centric Approach to Fast Photo-Realistic Image Generation. arXiv preprint.
- Ma, N., Tong, S., Jia, H., Hu, H., Su, Y., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., & Xie, S. (2025). Inference-time scaling for diffusion models beyond scaling denoising steps. CVPR 2025.