CVPR 2026

MV-RoMa: From Pairwise Matching into
Multi-View Track Reconstruction

Jongmin Lee1,2   Seungyeop Kang1   Sungjoo Yoo1
1Seoul National University   2KAIST

Summary

Overview

We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets in a single forward pass. By embedding sparse geometric priors as track tokens into DINOv2 and applying efficient pixel-aligned attention at the refinement stage, MV-RoMa produces geometrically consistent multi-view tracks without the prohibitive cost of full cross-attention — enabling dense and accurate 3D reconstruction in the SfM pipeline.

Qualitative Results

Left: Aachen dataset. Right: Gangnam Station dataset by NaverLabs.


Method

Pipeline

Stage 1

Track token construction

Pairwise matches are distilled into compact multi-view tracks via clustering-based sampling, forming a sparse geometric prior.

Stage 2

Track-guided encoder

Track tokens are injected into DINOv2 via attentional sampling, a track transformer, and attentional splatting — yielding view-consistent features.

Stage 3

Multi-view refiner

Source-aligned target features enable pixel-aligned attention at the finest scale, refining correspondences without quadratic cross-attention cost.


Contributions

1
First dense multi-view matcher — direct 1-to-N correspondence estimation in a single forward pass, naturally producing geometrically consistent tracks.
2
Efficient multi-view architecture — track-guided encoder and pixel-aligned refiner avoid full cross-attention overhead.
3
State-of-the-art across benchmarks — HPatches, ETH3D, Texture-Poor SfM, and IMC PhotoTourism.

Related Links

There are excellent works that are related with ours and gave us inspiration.

Dense-SfM proposes a dense SfM framework that integrates dense matching with Gaussian Splatting-based track extension and multi-view kernelized refinement for accurate and dense 3D reconstruction.

RoMa proposes a significantly robust dense feature matcher that serves as the pairwise matching backbone in our framework.

TrackTention proposes a video understanding framework that leverages point tracks as spatial anchors for efficient temporal attention, enabling strong spatiotemporal reasoning across frames.

BibTeX

@misc{lee2026mvromapairwisematchingmultiview,
      title={MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction},
      author={Jongmin Lee and Seungyeop Kang and Sungjoo Yoo},
      year={2026},
      eprint={2603.27542},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.27542},
}