MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency

Nicolas Dufour1,2, Lucas Degeorge*,1,2,3, Arijit Ghosh*,1, Vicky Kalogeiton†,2, David Picard†,1

1LIGM, ENPC, IP Paris, CNRS, UGE 2LIX, École Polytechnique, IP Paris 3AMIAD

TL;DR

Train once, align many rewards. Faster convergence, controllable trade-offs, and strong performance.

19×
Training speedup for Aesthetic
6.3×
Training speedup for HPSv2
3.5×
Training speedup for PickAScore
3.7×
Training speedup for ImageReward
75
Overall GenEval Score
370×
Cheaper inference cost than FLUX
34x
Less parameters than FLUX
MIRO displays high aesthetic quality and better text-image alignment on diverse prompts efficiently.

Contents

Introduction

How do you align generative AI with human preferences? This question has driven remarkable progress in both large language models and text-to-image generation. The formula is well-established: train on massive web-scale data, then align the model through curated datasets and reinforcement learning from human feedback (RLHF)[1], [2], [3]. Systems like FLUX[5] and Stable Diffusion 3[4] follow exactly this recipe.

The Problem

But this paradigm has a cost. Post-hoc alignment discards informative "low-quality" data[6], complicates training with an additional optimization stage, and often overfits to a single reward — leading to mode collapse, reduced diversity, or degraded semantic fidelity.

Our Question: Rather than correcting a pre-trained text-to-image model after the fact, can we teach it how to trade off multiple rewards from the beginning?

Our Solution: MIRO

Our answer is MIRO (MultI-Reward cOnditioning)—a framework that integrates multiple reward signals directly into the pretraining objective for text-to-image generation. Similar to previous work[6], we condition the generative model on a vector of reward scores per text-image pair. The rewards span aesthetics, user preference, semantic correspondence, visual reasoning, and domain-specific correctness. This way, the model learns an explicit mapping from desired reward levels to visual characteristics—right from the beginning.

This simple change has powerful consequences: it preserves the full spectrum of data quality instead of filtering it out, turns alignment into a controllable variable at inference time, and by providing rich supervision at scale, accelerates convergence and improves sample efficiency.

Key Benefits

Training Efficiency

MIRO eliminates separate fine-tuning or RL stages by integrating reward alignment directly into pretraining. Rich supervision at scale enables convergence up to 19× faster than regular pretraining.

📊
Full-Spectrum Data Utilization

Unlike pipelines that discard "low-quality" data, MIRO trains across the entire reward spectrum. This reduces mode collapse and yields representations that generalize across all quality levels.

🎛️
Controllable Alignment

Users can dial individual rewards up or down at inference time to achieve precise trade-offs—boost aesthetics without collapsing alignment, or prioritize compositional correctness for complex prompts.

🛡️
Reward Hacking Prevention

Single-objective optimization often leads to reward hacking[14] where models exploit specific metrics. MIRO's multi-dimensional conditioning naturally prevents this by balancing multiple objectives simultaneously.

Our Contributions

  1. We propose MIRO: reward-conditioned pretraining that integrates multiple rewards directly during training, eliminating the need for post-hoc alignment.
  2. State-of-the-art performance: Our small 350M-parameter model trained on just 16M images achieves top scores on GenEval[11] and user-preference metrics[10], [8], [9], outperforming much larger models like FLUX-dev[5] (12B parameters) trained for much longer.
  3. Unprecedented efficiency: MIRO converges up to 19× faster than regular training and achieves comparable quality with 370× less inference compute than FLUX[5].

Method

Our method consists of three key components that work together to enable efficient, controllable text-to-image generation:

1

Dataset Augmentation

Enrich the pretraining dataset with reward annotations across multiple quality dimensions

2

Multi-Reward Conditioned Training

Modify the flow matching objective to incorporate reward signals directly into the generative process

3

Reward-Guided Inference

Enable fine-grained control over generation quality through explicit reward conditioning during sampling

Problem Formulation

Let \(\mathcal{D} = \{(x^{(i)}, c^{(i)})\}_{i=1}^{M}\) be a large-scale pretraining dataset where \(x^{(i)} \in \mathbb{R}^{H \times W \times 3}\) represents an image and \(c^{(i)} \in \mathcal{T}\) represents the corresponding text condition (e.g., caption, prompt). Traditional pretraining learns a generative model \(p_\theta(x|c)\) that captures the joint distribution of images and text without explicit quality control.

In contrast, we consider a set of \(N\) reward models \(\mathcal{R} = \{r_1, r_2, \ldots, r_N\}\) where each \(r_j: \mathbb{R}^{H \times W \times 3} \times \mathcal{T} \rightarrow \mathbb{R}\) evaluates different aspects of image quality. Our goal is to learn a conditional generative model \(p_\theta(x|c, \mathbf{s})\) where \(\mathbf{s} = [s_1, s_2, \ldots, s_N]\) represents the desired reward levels, enabling controllable generation across multiple quality dimensions.

Dataset Augmentation with Reward Scores

The first step of MIRO involves augmenting the pretraining dataset with comprehensive reward annotations. For each sample \((x^{(i)}, c^{(i)}) \in \mathcal{D}\), we compute reward scores across all \(N\) reward models:

\[ s_j^{(i)} = r_j(x^{(i)}, c^{(i)}) \quad \forall j \in \{1, 2, \ldots, N\} \]

This process transforms our dataset into an enriched version \(\tilde{\mathcal{D}} = \{(x^{(i)}, c^{(i)}, \mathbf{s}^{(i)})\}_{i=1}^{M}\) where \(\mathbf{s}^{(i)} = [s_1^{(i)}, s_2^{(i)}, \ldots, s_N^{(i)}]\) contains the multi-dimensional quality assessment for each sample.

Score Normalization and Binning: Raw reward scores often exhibit different scales and distributions across reward models. We employ a uniform binning strategy into \(B\) bins that ensures balanced representation across quality levels, allowing the model to see the full spectrum of qualities during training.

MIRO Training Pipeline: Images and captions are evaluated by multiple reward models, producing a score vector ŝ. The noised image, caption, and scores condition the denoiser, teaching it to map reward levels to visual characteristics.

Multi-Reward Conditioned Flow Matching

Having augmented our dataset with reward scores, we now incorporate these signals into the generative model architecture. We build upon flow matching[12], a powerful framework for training continuous normalizing flows that has shown excellent performance in high-resolution image generation.

Training Objective. Following the standard flow matching formulation, we sample noise \(\epsilon \sim \mathcal{N}(0, I)\) and time \(t \sim \mathcal{U}(0, 1)\), then compute the noisy sample \(x_t = (1-t)x + t\epsilon\). The multi-reward flow matching loss becomes:

\[ \mathcal{L} = \mathbb{E}_{(x,c,\hat{\mathbf{s}}) \sim \tilde{\mathcal{D}}, \epsilon \sim \mathcal{N}(0,I), t \sim \mathcal{U}(0,1)}\left[\left\|v_\theta(x_t, c, \hat{\mathbf{s}}) - (\epsilon-x)\right\|_2^2\right] \]

This objective trains the model to predict the difference between the noise and the clean image, conditioned on both the text prompt and the desired quality levels. The model learns to associate different reward levels with corresponding visual characteristics, enabling reward-aware generation.

Training Dynamics. During training, the model observes the full spectrum of quality levels for each reward dimension. This exposure allows it to learn the relationship between reward values and visual features, from low-quality samples that may exhibit artifacts or poor composition to high-quality samples with superior aesthetics and text alignment. The model learns the entire reward landscape rather than overfitting to a single objective.

Reward-Guided Inference

At inference time, MIRO provides unprecedented control over the generation process through explicit reward conditioning. We offer three complementary sampling strategies:

🎯
High-Quality Mode

Maximize all rewards simultaneously for best overall quality

🚀
Classifier-Free Guidance

Amplify quality through positive/negative reward contrasts

🎛️
Custom Trade-offs

Fine-tune individual reward dimensions on demand

1. High-Quality Generation

For generating high-quality samples, we condition the model on maximum reward values across all \(N\) dimensions: \(\hat{\mathbf{s}}_{\mathrm{max}} = [B-1, B-1, \ldots, B-1]\). This instructs the model to generate samples that maximize all reward objectives simultaneously.

2. Multi-Reward Classifier-Free Guidance

We extend classifier-free guidance to the multi-reward setting by leveraging the reward conditioning mechanism. Following the Coherence-Aware CFG approach[6], we introduce a positive and a negative reward target, denoted \(\hat{\mathbf{s}}^{+}\) and \(\hat{\mathbf{s}}^{-}\), which can be chosen by the user for controllability.

Default Configuration: We use \(\hat{\mathbf{s}}^{+}=\hat{\mathbf{s}}_{\mathrm{max}}=[B-1,\ldots,B-1]\) (all rewards high) and \(\hat{\mathbf{s}}^{-}=\hat{\mathbf{s}}_{\mathrm{min}}=[0,\ldots,0]\) (all rewards low), with \(\omega\) as the guidance scale.
\[ \hat{v}_\theta(x_t, c) = v_\theta(x_t, c, \hat{\mathbf{s}}^{+}) + \omega \left(v_\theta(x_t, c, \hat{\mathbf{s}}^{+}) - v_\theta(x_t, c, \hat{\mathbf{s}}^{-})\right) \]
Glossary
  • \(x_t=(1-t)x + t\epsilon\): noised sample
  • \(\hat{\mathbf{s}}\): reward targets (Aesthetic, Pick, CLIP, HPSv2, ImageReward, ...)
  • \(\omega\): guidance scale
  • \(\hat{\mathbf{s}}^{+}\)/\(\hat{\mathbf{s}}^{-}\): positive/negative reward targets
  • \(B\): number of bins for reward normalization

MIRO Inference Pipeline: A prompt, along with positive (s+) and negative (s-) reward targets, are input to the MIRO model. Starting from noise, the model iteratively denoises the image, guided by the reward targets, to produce the final high-quality output.

Theoretical Interpretation: This guidance formulation approximates the gradient of an implicit joint reward function. The guidance direction \(v_\theta(x_t, c, \hat{\mathbf{s}}_{\mathrm{max}}) - v_\theta(x_t, c, \hat{\mathbf{s}}_{\mathrm{min}})\) points toward regions where all rewards are simultaneously high, steering generation away from low-quality outputs.

By amplifying this direction with the guidance scale \(\omega\), we push generated samples toward parts of the distribution characterized by superior aesthetic quality, text alignment, and other desired attributes. Similar to the weak guidance framework[13], where a bad version of the model guides the good version, here guidance is provided by the contrast between high-reward and low-reward conditioning.

3. Flexible Reward Trade-offs

A key advantage of MIRO is the ability to specify custom reward targets at inference time. Users can set \(\hat{\mathbf{s}}_{\mathrm{custom}} = [\hat{s}_1, \hat{s}_2, \ldots, \hat{s}_N]\) where each component represents the desired level for that reward dimension.

Interactive Reward Control: As we adjust the reward weights in real-time, the generated output changes to reflect the new quality targets. The histogram on the left shows the current reward configuration, while the MIRO model processes these weights to generate the corresponding output on the right. Notice how different reward combinations lead to distinct visual characteristics.

Example use cases:

  • Dial up aesthetics without collapsing text alignment
  • Prioritize compositional correctness for complex prompts
  • Balance multiple objectives according to application needs
  • Explore the quality-alignment frontier interactively

Training efficiency

MIRO accelerates training convergence significantly: MIRO converges substantially faster across all metrics — 19× faster on AestheticScore, 6.2× faster on HPSv2, 3.5× faster on PickScore, and 3.3× faster on ImageReward.

MIRO demonstrates significant training efficiency gains compared to traditional pretraining. The charts below show convergence curves across different reward metrics. This dramatic acceleration stems from the additional supervisory signal provided by reward conditioning—by teaching the model to associate reward levels with visual characteristics from the start, we provide rich supervision at every training step.

Qualitative evidence of accelerated convergence

Visual quality emerges 4× faster: MIRO establishes proper compositional layout and generates visually appealing images within 50k training steps—a level of quality that requires 200k steps for the baseline to achieve.

The progression visualizations below provide compelling qualitative evidence of MIRO's training acceleration. Use the slider to step through training checkpoints and observe how quickly MIRO learns compared to the baseline:

These qualitative improvements directly complement our quantitative findings above, demonstrating that MIRO's multi-reward conditioning doesn't just improve metrics—it enables fundamentally faster learning of complex compositional concepts and visual aesthetics.

Baseline model training progression showing image quality evolution over training steps
Baseline
MIRO model training progression demonstrating faster convergence and better image quality
MIRO
Baseline model training progression showing image quality evolution over training steps
Baseline
MIRO model training progression demonstrating faster convergence and better image quality
MIRO
Baseline model training progression showing image quality evolution over training steps
Baseline
MIRO model training progression demonstrating faster convergence and better image quality
MIRO
Baseline model training progression showing image quality evolution over training steps
Baseline
MIRO model training progression demonstrating faster convergence and better image quality
MIRO

Synergizing with test-time scaling

MIRO demonstrates superior sample efficiency: MIRO consistently outperforms the baseline across all reward metrics, often by substantial margins. Most remarkably, for Aesthetic Score and HPSv2, MIRO achieves with a single sample what the baseline cannot reach even with 128 samples.

Test-time scaling—generating multiple samples and selecting the best one—has emerged as a popular method to improve reward performance[15]. We demonstrate that MIRO achieves superior sample efficiency compared to baseline models when combined with this technique. The charts below show performance across varying sample counts (1 to 128 samples, displayed on a log-2 scale).

Quantifying inference-time efficiency improvements: The efficiency gains are particularly striking for specific metrics. For ImageReward, MIRO with 8 samples matches the performance of the baseline with 128 samples, representing a 16× efficiency improvement. For PickScore, MIRO achieves equivalent performance with only 4 samples compared to the baseline's 128 samples, demonstrating a remarkable 32× efficiency gain.

Balanced performance across all metrics

Multi-reward conditioning mitigates reward hacking: While single-reward models achieve high scores on their target metric, they severely degrade performance on others. MIRO's comprehensive optimization avoids this overfitting and delivers strong, balanced gains across all dimensions.

MIRO outperforms single-reward approaches

We evaluate three training configurations on the CC12M+LA6 dataset: (1) a baseline model trained without reward conditioning, (2) single-reward models conditioned on individual rewards (similar to Coherence Aware Diffusion[6] but using our reward suite instead of CLIP score), and (3) MIRO conditioned on all seven rewards simultaneously. The radar plots below show results across AestheticScore, PickScore, ImageReward, HPSv2, and JINA CLIP score, plus OpenAI CLIP score as an out-of-distribution metric not used during training.

The reward hacking problem. Single-reward optimization leads to severe trade-offs. This is particularly evident with AestheticScore—while the single-reward model achieves high aesthetic scores, it severely degrades performance on other metrics. Models trained on ImageReward and HPSv2 show more balanced trade-offs but still underperform MIRO's comprehensive optimization.

Select a comparison model above to visualize how MIRO performs against different baselines. Notice how single-reward models excel on their target metric but often underperform on others, while MIRO maintains strong performance across all dimensions.

Enhancing compositional understanding

GenEval improvements: MIRO achieves an overall GenEval score of 57, representing a 9.6% improvement over the baseline score of 52. The gains are particularly pronounced in challenging compositional reasoning tasks.

Beyond optimizing for specific reward metrics, MIRO demonstrates significant improvements in text-image alignment as measured by GenEval. This enhancement is particularly pronounced in challenging compositional reasoning tasks:

🎨
Color Attribution

29 → 38

+31% improvement

🔢
Two Objects

55 → 68

+24% improvement

🧮
Counting

49 → 55

+12% improvement

These results demonstrate that MIRO's multi-reward conditioning enables better understanding of complex spatial relationships, object interactions, and numerical concepts—achieving balanced optimization that excels across diverse evaluation criteria while maintaining strong performance on individual metrics.

MIRO and synthetic captions

MIRO unlocks synthetic caption potential: Combining MIRO with synthetic captions yields the strongest overall performance, achieving a remarkable GenEval score of 68 (+19% over synthetic baseline) while maintaining aesthetic quality.

Synthetic captioning has emerged as the go-to method for improving text-image alignment in generative models, offering the advantage of retaining all training data without filtering based on caption quality. We evaluate MIRO using a mixture of 50% synthetic and 50% real captions.

MIRO outperforms synthetic captioning alone

Our results demonstrate that MIRO without synthetic captions achieves comparable GenEval performance to baseline models trained with synthetic captions. More importantly, MIRO without synthetic captions significantly outperforms the synthetic caption baseline across reward metrics.

More effective and computationally efficient. MIRO provides a more effective approach to improving text-image alignment than synthetic captioning alone, while being computationally more efficient. Reward model scoring requires substantially less compute than recaptioning with large vision-language models.

Combining MIRO with synthetic captions

Combining MIRO with synthetic captions yields the strongest overall performance. While maintaining equivalent aesthetic quality to MIRO without synthetic captions, this combined approach achieves a remarkable GenEval score of 68, substantially improving over the synthetic caption baseline of 57 (+19%). The improvements are consistent across all compositional reasoning metrics:

📍
Position

30 → 46

+53% improvement

🎨
Color Attribution

43 → 52

+21% improvement

🔢
Two Objects

58 → 73

+26% improvement

🧮
Counting

44 → 61

+39% improvement

🎯
Single Object

93 → 97

+4% improvement

Overall GenEval

57 → 68

+19% improvement

These comprehensive gains across all compositional aspects demonstrate that MIRO effectively benefits from synthetic captions for text-image alignment, achieving superior compositional understanding while preserving aesthetic quality.

Detailed comparison: The charts below show GenEval categories (left) and aesthetic metrics (right) for Baseline/MIRO with real captions and Synth Baseline/Synth MIRO with 50/50 synthetic captions.

Flexible reward trade-offs at inference

User-controlled rewards at inference: MIRO allows choosing reward weights at test time, enabling principled trade-offs across capabilities. This gives users control over generation characteristics and reduces reward hacking.

Reward weighting exposes controllable trade-offs

Our test-time scaling results show that selecting samples by Aesthetic Score can reduce GenEval performance, indicating a trade-off between aesthetic quality and semantic alignment. By steering the reward vector at inference, users can choose where to land on the aesthetics ↔ alignment frontier—a key practical advantage of training on reward vectors rather than aligning after the fact.

Sweeping the aesthetic weight identifies an optimal balance. We vary the aesthetic reward weight at inference and observe the highest GenEval score of 75 at a weight of 0.625, at the cost of lowering the Aesthetic Score to 5.24. Using this optimized weighting achieves the same performance as 128-sample test-time scaling with a single sample. The charts below visualize this trade-off: the left plot shows the GenEval vs. Aesthetic Score curve, while the synchronized grid on the right shows how other reward metrics respond to the aesthetic weight adjustment.
⚖️
Optimal weighting

Weight 0.625 → GenEval 75

Maximizes compositional understanding.

🚀
Matches test-time scaling

1 sample = 128 samples

Single weighted sample matches best TTS results.

🎛️
Interactive control

Real-time exploration

Use the slider to see all metrics change live.

Visualizing per-reward controllability

Multi-reward classifier-free guidance: We visualize MIRO's controllability using multi-reward classifier-free guidance. By isolating individual rewards while keeping others anchored high, we can demonstrate true per-dimension control at sampling time.

The images below show how MIRO responds when we isolate each reward dimension. For each column, we set the positive target to maximize all rewards and the negative target to zero out one specific reward while keeping others high. This cancels the shared direction and isolates that reward's unique visual characteristics. Notice how each reward emphasizes different aspects: aesthetics affects overall visual appeal, CLIP strengthens text-image correspondence, HPSv2 influences composition, and so on.

🎨 Aesthetics

Overall visual appeal and artistic quality

📝 CLIP

Text-image semantic correspondence

⭐ HPSv2

Composition and user preference

🏆 ImageReward

Human preference alignment

Pairwise reward exploration

Interactive pairwise control: Select any two reward dimensions and smoothly interpolate between them to see how the model responds to different reward combinations.

We use multi-reward classifier-free guidance to isolate the trade-off between two specific rewards while keeping all others anchored at their maximum values. Given two selected rewards A and B, we set the positive target to maximum for all rewards: \(\hat{\mathbf{s}}^{+} = [B-1, \ldots, B-1]\). For the negative target, we also set all rewards to maximum except for the two rewards being interpolated:

\[ \hat{\mathbf{s}}^{-} = [B-1, \ldots, B-1] \quad \text{except:} \]
\[ \hat{s}_A^{-} = t \cdot (B-1), \quad \hat{s}_B^{-} = (1-t) \cdot (B-1) \]

where \(t \in [0,1]\) is the interpolation parameter. When \(t=0\), the guidance direction \(v_\theta(x_t, c, \hat{\mathbf{s}}^{+}) - v_\theta(x_t, c, \hat{\mathbf{s}}^{-})\) emphasizes reward B while suppressing reward A; when \(t=1\), the opposite occurs. This formulation cancels out the shared quality direction (since all other rewards remain constant in both \(\hat{\mathbf{s}}^{+}\) and \(\hat{\mathbf{s}}^{-}\)) and isolates the specific visual characteristics associated with the trade-off between rewards A and B.

Use the dropdowns below to select two rewards, then adjust the slider to vary \(t\) from 0 to 1. The interface displays one of 32 generated images spanning this spectrum, allowing you to explore how the model navigates different quality trade-offs.

MIRO generated image demonstrating controllable trade-offs between different reward model objectives
Loading...
Aesthetic Score CLIP Score
Interpolation: 0.500

Comparison to State-of-the-Art Models

MIRO sets new benchmarks for efficiency and performance: Our compact 350M-parameter model achieves a GenEval score of 75, outperforming FLUX-dev (12B parameters, 67) while requiring 370× less computation (4.16 vs 1540 TFLOPs). With test-time scaling, MIRO reaches state-of-the-art ImageReward of 1.61 while maintaining a 3× efficiency advantage over FLUX-dev.

We evaluate MIRO against state-of-the-art text-to-image models including FLUX-dev, Stable Diffusion variants, PixArt, Sana, and SDXL. The animated charts below visualize performance on GenEval (compositional understanding) and ImageReward (user preference), with models sorted from lowest to highest scores. MIRO models are highlighted in orange.

GenEval: Exceptional training efficiency

MIRO achieves a GenEval score of 68 with synthetic captions, already outperforming FLUX-dev (12B parameters) which scores 67. With optimized inference-time reward weighting, MIRO reaches 75—setting a new state-of-the-art while requiring dramatically less computation: 4.16 TFLOPs vs 1540 TFLOPs for FLUX-dev, a remarkable 370× efficiency improvement. This demonstrates that MIRO's multi-reward conditioning enables compact models to surpass much larger architectures.

Setting new benchmarks for compositional reasoning

📍
Position Understanding

Score: 46

+35% over previous SOTA (SD3-Medium: 34)

🎨
Color Attribution

Score: 52

+11% over previous SOTA (FLUX-dev: 47)

Beyond overall performance, MIRO excels on challenging compositional metrics that have historically been difficult for text-to-image models. On the Position metric, MIRO achieves a score of 46, improving upon the previous state-of-the-art of 34 (SD3-Medium) by 35%. For Color Attribution, MIRO advances from FLUX-dev's previous best of 47 to 52 (+11%). These improvements highlight MIRO's superior understanding of complex spatial relationships and object attributes.

The chart below compares MIRO's GenEval performance against all SOTA baselines, with models sorted from lowest to highest scores. MIRO models are highlighted in orange.

User preference: Scalable efficiency with test-time optimization

On PartiPrompts, MIRO consistently outperforms larger models across multiple reward metrics. When optimizing for ImageReward with 128-sample inference scaling, MIRO achieves a state-of-the-art score of 1.61 compared to FLUX-dev's 1.19 and Sana-1.6B's 1.23.

Remarkably, even with this 128-sample inference scaling strategy, MIRO maintains a 3× efficiency advantage over FLUX-dev (532 TFLOPs vs 1540 TFLOPs) while achieving superior performance across all metrics. For Aesthetic Score optimization, MIRO reaches 6.81 compared to FLUX-dev's 6.56.

The chart below shows ImageReward scores across all models, demonstrating MIRO's superior user preference alignment. Models are sorted from lowest to highest, with MIRO variants in orange.

Cross-metric generalization through multi-reward conditioning

A key advantage of MIRO is its ability to generalize across metrics without explicit optimization. For instance, when optimizing for HPSv2, MIRO achieves an ImageReward score of 1.35, outperforming models specifically trained for that metric. This cross-metric robustness demonstrates that multi-reward conditioning naturally learns generalizable quality representations rather than exploiting individual metric idiosyncrasies.

Computational efficiency comparison

The charts below visualize the dramatic difference in model size and computational requirements using logarithmic scales to better show the magnitude of MIRO's efficiency gains. MIRO's compact 350M-parameter architecture requires 33× fewer parameters than FLUX-dev and 370× less compute for inference, while achieving superior performance on both metrics above.

Note: Efficiency charts above use green for MIRO models and slate-gray for baselines, with lower values being better. Both charts use logarithmic scales to show the magnitude of differences. Bars animate on load, growing and then sorting from lowest to highest values.

Conclusion

We presented MIRO (MultI-Reward cOnditioning), a simple yet powerful pretraining framework that integrates alignment directly into training rather than treating it as a post-hoc stage. By conditioning on a vector of reward scores, MIRO learns \(p(x\mid c, \mathbf{s})\) and exposes reward targets as controllable inputs, disentangling content from quality and offering precise, interpretable control at inference time.

What We Achieved

🚀
19×
Faster Convergence

MIRO converges up to 19× faster than regular pretraining on aesthetic metrics, accelerating development cycles dramatically.

370×
Less Compute

Achieves comparable quality with 370× less inference compute than FLUX-dev, making high-quality generation accessible.

🏆
75
GenEval Score

Outperforms FLUX-dev (12B params) on compositional understanding with just 350M parameters—a new efficiency benchmark.

🎯
7
Reward Dimensions

Simultaneously optimizes aesthetics, user preference, semantic alignment, and more—preventing reward hacking.

🎛️
Controllability

Fine-grained control over quality trade-offs at inference time without retraining or model collapse.

📊
100%
Data Utilization

Preserves the full spectrum of training data instead of discarding "low-quality" samples, maximizing learning.

Key Takeaways

1
Single-Stage Training

MIRO eliminates the need for separate fine-tuning or RL stages, simplifying the training pipeline while achieving superior results.

2
Multi-Objective Balance

By optimizing multiple rewards simultaneously, MIRO prevents reward hacking and mode collapse that plague single-objective approaches.

3
Exceptional Efficiency

Despite being much smaller (350M vs 12B params), MIRO surpasses FLUX-dev on GenEval and PartiPrompts at a fraction of the computational cost.

4
Inference-Time Control

Users can dial individual rewards up or down at generation time, achieving precise trade-offs without expensive test-time search or retraining.

Looking Forward

We believe MIRO opens a new direction for leveraging reward models in generative AI. Rather than treating alignment as a correction mechanism applied after the fact, integrating rewards from the beginning enables models that are faster to train, more controllable, and more efficient to deploy.

Beyond Text-to-Image Generation

This paradigm shift—from post-hoc alignment to reward-conditioned pretraining—could extend to other domains where multiple quality dimensions matter: large language models (balancing helpfulness, harmlessness, and accuracy), video generation (temporal consistency, motion quality, aesthetics), 3D synthesis (geometric accuracy, visual realism, physical plausibility), and audio generation (fidelity, naturalness, clarity).

Personalized Reward Spaces

An exciting future direction is discovering a basis of fundamental reward dimensions that could represent any user preference. Just as colors can be composed from RGB primaries, could we find a minimal set of reward "basis vectors" from which any personalized reward emerges as a linear combination? MIRO's multi-reward conditioning framework could then enable users to define custom quality trade-offs on the fly—dialing in their unique preferences without requiring new models or expensive retraining. This would transform alignment from a one-size-fits-all solution into a personalized, interpretable control surface over generation quality.

Acknowledgements

💰

This work was supported by ANR project TOSAI ANR-20-IADJ-0009, and was granted access to the HPC resources of IDRIS under the allocation 2024-A0171014246 made by GENCI.

🙏

We would like to thank Alyosha Efros, Tero Karras, and Luca Eyring for their helpful comments, and Yuanzhi Zhu and Xi Wang for proofreading.

Citation

If you find MIRO useful for your research, please consider citing our paper:

@article{dufour2025miro,
  title   = {MIRO: Multi-Reward Conditioned Pretraining for Text-to-Image Generation},
  author  = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2025}
}

References

  1. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
  2. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
  3. Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., & Lee, K. (2023). DDPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models. NeurIPS 2023.
  4. Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICLR 2024.
  5. Black Forest Labs. (2024). FLUX. https://github.com/black-forest-labs/flux
  6. Dufour, N., Besnier, V., Kalogeiton, V., & Picard, D. (2024). Don't drop your samples! Coherence-aware training benefits Conditional diffusion. CVPR 2024.
  7. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). LAION-5B: An Open Large-scale Dataset for Training Next Generation Image-Text Models. NeurIPS 2022.
  8. Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., & Li, H. (2023). Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv preprint.
  9. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., & Levy, O. (2023). Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. NeurIPS 2023.
  10. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., & Dong, Y. (2023). ImageReward: Learning and Leveraging Human Preferences for Text-to-Image Generation. NeurIPS 2023.
  11. Ghosh, D., Hajishirzi, H., & Schmidt, L. (2024). Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36.
  12. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling. arXiv preprint.
  13. Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., & Laine, S. (2024). Guiding a diffusion model with a bad version of itself. NeurIPS 2024.
  14. Luo, Y., Hu, T., Luo, W., Kawaguchi, K., & Tang, J. (2025). Reward-Instruct: A Reward-Centric Approach to Fast Photo-Realistic Image Generation. arXiv preprint.
  15. Ma, N., Tong, S., Jia, H., Hu, H., Su, Y., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., & Xie, S. (2025). Inference-time scaling for diffusion models beyond scaling denoising steps. CVPR 2025.