MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

Summary

We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets in a single forward pass. By embedding sparse geometric priors as track tokens into DINOv2 and applying efficient pixel-aligned attention at the refinement stage, MV-RoMa produces geometrically consistent multi-view tracks without the prohibitive cost of full cross-attention — enabling dense and accurate 3D reconstruction in the SfM pipeline.

Qualitative Results

Left: Aachen dataset. Right: Gangnam Station dataset by NaverLabs.

Method

Stage 1

Track token construction

Pairwise matches are distilled into compact multi-view tracks via clustering-based sampling, forming a sparse geometric prior.

Stage 2

Track-guided encoder

Track tokens are injected into DINOv2 via attentional sampling, a track transformer, and attentional splatting — yielding view-consistent features.

Stage 3

Multi-view refiner

Source-aligned target features enable pixel-aligned attention at the finest scale, refining correspondences without quadratic cross-attention cost.

Contributions

1

First dense multi-view matcher — direct 1-to-N correspondence estimation in a single forward pass, naturally producing geometrically consistent tracks.

2

Efficient multi-view architecture — track-guided encoder and pixel-aligned refiner avoid full cross-attention overhead.

3

State-of-the-art across benchmarks — HPatches, ETH3D, Texture-Poor SfM, and IMC PhotoTourism.

BibTeX

@misc{lee2026mvromapairwisematchingmultiview,
      title={MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction},
      author={Jongmin Lee and Seungyeop Kang and Sungjoo Yoo},
      year={2026},
      eprint={2603.27542},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.27542},
}

MV-RoMa: From Pairwise Matching intoMulti-View Track Reconstruction