The Power of Context: How Multimodality Improves Image Super-Resolution

Kangfu Mei ^{1, 2, *} , Hossein Talebi ¹ , Mojtaba Ardakani ¹ , Vishal M. Patel ² ,
Peyman Milanfar ¹ , Mauricio Delbracio ¹

Google¹ Johns Hopkins University ²
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025 * work was done while author was an intern at Google

Paper arXiv

Abstract

Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities --including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity.

Multimodal SR Results

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

Photo of a stained-glass ceiling, circular composition. The ceiling features a radiating pattern of yellow, white, and gray stained glass panels. Dark brown central element acts as a focal point. The style is Art Nouveau, detailed lines, and intricate design.

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

Photo of a large wooden building with a steeply pitched roof, multiple levels, and numerous balconies adorned with hanging flowers. The building is constructed primarily of light brown wood with a stone base, exhibiting traditional Romanian architecture. People are visible near the bottom of the building, providing a sense of scale. The photo is taken from a low angle, showcasing the full height and grandeur of the structure against a clear sky.

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

Photo of a baseball stadium scoreboard, close-up view. The image shows the words "CIPAL PARK" prominently displayed on a metal structure above various advertisements including Bud Light and Coca-Cola. Below the main sign are several smaller screens displaying baseball action and other promotional material. The overall composition is centered on the stadium signage against a blurred background of netting and sky.

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

Photo of skyscrapers viewed from below, black and white. The image features three tall buildings, two on the sides and one dominating the lower half. The composition is symmetrical, with the buildings framing a bright white sky. The style is minimalist, emphasizing geometric lines and textures.

MMSR's Depth Input

MMSR's Edge Input

MMSR's Seg Input

MMSR's Prompt Input

Photo of two surfers on a grassy cliff overlooking a black sand beach. The surfers are positioned slightly off-center, each carrying a white surfboard. The dark volcanic sand contrasts sharply with the white surf and green grass. The ocean waves are visible in the upper portion of the image creating a clear horizon line.

*See paper and supplementary material for experimental settings.

Motivation

Language models struggle to accurately represent spatial information, leading to coarse and imprecise image super-resolution. To overcome this limitation, we incorporate additional spatial modalities like depth maps and semantic segmentation maps. These modalities provide detailed spatial context, allowing our model to implicitly align language descriptions with individual pixels through a transformer network. This enriched understanding of the image significantly enhances the realism of our super-resolution results and minimizes distortion.

Method

Starting with a low-resolution (LR) image, we extract modalities like depth and semantic segmentation maps. These modalities are encoded into tokens and transformed into multimodal latent tokens ($m$). Our diffusion model uses these tokens and the LR input to generate a high-resolution (SR) output. A multimodal classifier-free guidance m-cfg refines the SR image for enhanced quality.

Real-world Multimodal SR Results Comparisons

Inputs			Outputs
	Two people walk along a narrow path carved into the reddish-orange sandstone cliffs. A small evergreen tree grows in the foreground near the base of the cliff \dots
LR	Caption	Patch 1	PASD (Zoomed)	SeeSR (Zoomed)	SUPIR (Zoomed)	MMSR (Zoomed)

	Photo of a mining town nestled on a hillside, autumnal colors. The buildings are predominantly low-slung structures with metal roofs, some \dots
LR	Caption	Patch 1	PASD (Zoomed)	SeeSR (Zoomed)	SUPIR (Zoomed)	MMSR (Zoomed)

	Photo of a weathered exterior wall detail, grunge, aged. A rusty metal lamp is mounted on a dark vertical beam against a crumbling ochre and brown stucco wall \dots
LR	Caption	Patch 1	PASD (Zoomed)	SeeSR (Zoomed)	SUPIR (Zoomed)	MMSR (Zoomed)

	Photo of a high-detail stone sculpture, sepia toned. The sculpture depicts Jesus Christ seated, his hands raised in a gesture of blessing \dots
LR	Caption	Patch 1	PASD (Zoomed)	SeeSR (Zoomed)	SUPIR (Zoomed)	MMSR (Zoomed)

Robustness to noisy multi-modal inputs

Severely degraded, low-resolution (LR) images can produce inaccurate multi-modal information, manifesting as distorted edges, misidentified objects, and other artifacts. To further illustrate the robustness, the figure below shows the behavior of each method and strategy on different qualities of LR images, as measured by MUSIQ.

Our LR pre-processing effectively mitigates the impact of inaccurate multi-modal data in low-quality input images (low MUSIQ LR score), ensuring consistent SR performance. The resulting scatter plots, characterized by a shallow slope and high y-intercept, confirm the method's stability and resilience to low-quality LR inputs.

Quantitative Performance

Quantitative comparison with state-of-the-art methods on both synthetic and real-world benchmarks. The best and second best results of each metric are highlighted in red and blue, respectively.

Multimodal Controllability Comparisons

Our method allows for fine-grained control over super-resolution results by adjusting the influence of each input modality. For example, reducing the edge temperature enhances edge sharpness (first row). Lowering the segmentation temperature emphasizes distinct features, such as the star pattern on the flag (second row). Decreasing the depth temperature accentuates depth-of-field effects, like the bokeh between the foreground and background (third row).

BibTeX


@inproceedings{mei2025mmsr,
  title={The Power of Context: How Multimodality Improves Image Super-Resolution},
  author={Mei, Kangfu and Talebi, Hossein and Ardakani, Mojtaba and Patel, Vishal M and Milanfar, Peyman and Delbracio, Mauricio},
  year={2025},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
}