Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities --including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity.
*See paper and supplementary material for experimental settings.
Language models struggle to accurately represent spatial information, leading to coarse and imprecise image super-resolution. To overcome this limitation, we incorporate additional spatial modalities like depth maps and semantic segmentation maps. These modalities provide detailed spatial context, allowing our model to implicitly align language descriptions with individual pixels through a transformer network. This enriched understanding of the image significantly enhances the realism of our super-resolution results and minimizes distortion.
Starting with a low-resolution (LR) image, we extract modalities like depth and semantic segmentation
maps. These modalities are encoded into tokens and transformed into multimodal latent tokens (
Inputs | Outputs | ||||||
---|---|---|---|---|---|---|---|
![]() |
Two people walk along a narrow path carved into the
reddish-orange sandstone cliffs. A small evergreen tree grows in the foreground near the base of
the cliff \dots
|
![]() |
![]() |
![]() |
![]() |
![]() |
|
LR | Caption | Patch 1 | PASD (Zoomed) | SeeSR (Zoomed) | SUPIR (Zoomed) | MMSR (Zoomed) | |
![]() |
Photo of a mining town nestled on a hillside, autumnal
colors. The buildings are predominantly low-slung structures with metal roofs, some \dots
|
![]() |
![]() |
![]() |
![]() |
![]() |
|
LR | Caption | Patch 1 | PASD (Zoomed) | SeeSR (Zoomed) | SUPIR (Zoomed) | MMSR (Zoomed) | |
![]() |
Photo of a weathered exterior wall detail, grunge, aged. A
rusty metal lamp is mounted on a dark vertical beam against a crumbling ochre and brown stucco
wall \dots
|
![]() |
![]() |
![]() |
![]() |
![]() |
|
LR | Caption | Patch 1 | PASD (Zoomed) | SeeSR (Zoomed) | SUPIR (Zoomed) | MMSR (Zoomed) | |
![]() |
Photo of a high-detail stone sculpture, sepia toned. The
sculpture depicts Jesus Christ seated, his hands raised in a gesture of blessing \dots
|
![]() |
![]() |
![]() |
![]() |
![]() |
|
LR | Caption | Patch 1 | PASD (Zoomed) | SeeSR (Zoomed) | SUPIR (Zoomed) | MMSR (Zoomed) |
Severely degraded, low-resolution (LR) images can produce inaccurate multi-modal information, manifesting as distorted edges, misidentified objects, and other artifacts. To further illustrate the robustness, the figure below shows the behavior of each method and strategy on different qualities of LR images, as measured by MUSIQ.
Our LR pre-processing effectively mitigates the impact of inaccurate multi-modal data in low-quality input images (low MUSIQ LR score), ensuring consistent SR performance. The resulting scatter plots, characterized by a shallow slope and high y-intercept, confirm the method's stability and resilience to low-quality LR inputs.
Quantitative comparison with state-of-the-art methods on both synthetic and real-world benchmarks. The best and second best results of each metric are highlighted in red and blue, respectively.
Our method allows for fine-grained control over super-resolution results by adjusting the influence of each input modality. For example, reducing the edge temperature enhances edge sharpness (first row). Lowering the segmentation temperature emphasizes distinct features, such as the star pattern on the flag (second row). Decreasing the depth temperature accentuates depth-of-field effects, like the bokeh between the foreground and background (third row).
@inproceedings{mei2025mmsr,
title={The Power of Context: How Multimodality Improves Image Super-Resolution},
author={Mei, Kangfu and Talebi, Hossein and Ardakani, Mojtaba and Patel, Vishal M and Milanfar, Peyman and Delbracio, Mauricio},
year={2025},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
}