DINeMo

In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilizing both local appearance features and global context information.

Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images, which demonstrate the advantages over supervised learning methods that rely on 3D annotations.

DINeMo

Preliminaries. Neural mesh models define a probabilistic generative model of feature activations using a 3D neural mesh $\mathfrak{N} = \{\mathcal{V}, \mathcal{E}, \mathcal{C}\}$ with vertices V, edge set \mathcal{E}, and vertex-level features \mathcal{C}. Given pose m, we define the likelihood of a target feature map F as

p(F \mid \mathfrak{N}, m, C_b) = \prod_{i \in \mathcal{FG}} p(f_i \mid \mathfrak{N}, m) \prod_{i' \in \mathcal{BG}} p(f_{i'} \mid C_b)

This allows us to predict object category , 3D/6D object poses , and object shape from an analysis-by-synthesis approach. Neural mesh models are often trained with part-contrastive loss with vertex correspondence lables, which are often obtained from 3D pose annotations in PASCAL3D+ or ImageNet3D .

Our approach. Our goal is to exploit pseudo-correspondence obtained from large visual foundation models, such as DINOv2 . Given object images without 3D annotations, we first generate pseudo-vertex-correspondence from a pretrained SD-DINO model using our bidirectional pseudo-correspondence generation. Then we train our DINeMo model with part correspondence loss on car images from standard 3D pose estimation dataset or abundant unannotated images from the Internet, e.g., from the StanfordCars dataset .

DINeMo overview — **Figure 1.** Overview of our DINeMo model.

Bidirectional pseudo-correspondence generation. We find that raw pseudo-correspondence can be quite noisy, e.g., keypoints are often mismatched between left and right. We argue that keypoint correspondence matching should consider both local information, i.e., per patch feature similarities, and global context information, i.e., 3D orientation of the object. Based on this motivation, we propose a novel bidirectional pseudo-correspondence generation, which consists of two steps: (i) matching a global pose label from raw keypoint correspondences, and (ii) refine local keypoint correspondences based on the predicted global pose label.

Occlusion-aware analysis-by-synthesis. Previous neural mesh models adopt explicit occlusion reasoning during inference. However, directly predicting occlusion from neural features yields noisy masks and hurts final performance. With the recent rise of foundation models like Segment Anything , we extend standard analysis-by-synthesis inference with Grounded-SAM masks, achieving enhanced occlusion robustness.

Main Results

3D object pose estimation on the car split of Pascal3D+ and occluded PASCAL3D+.

Semantic correspondence evaluation on car split of SPair71k. Our DINeMo outperforms all previous methods by a wide margin and achieves comparable performance with Telling Left from Right that use index to flip source keypoints at test time.

Scaling properties of DINeMo with more unlabled object images used for training.

Qualitative Examples

Qualitative comparisons with and without our bidirectional pseudo-correspondence generation.

Qualitative comparisons between DINOv2 and our DINeMo on SPair71k dataset.

Qualitative pose estimation results on PASCAL3D+ dataset.

DINeMo: Learning Neural Mesh Models with no 3D Annotations

CVPR 2025 C3DV Workshop

DINeMo

Main Results

Qualitative Examples

BibTeX

Notes