DINeMo: Learning Neural Mesh Models with no 3D Annotations

CVPR 2025 C3DV Workshop

Weijie GuoGuofeng ZhangWufei MaAlan Yuille
Johns Hopkins University

In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilizing both local appearance features and global context information.

Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images, which demonstrate the advantages over supervised learning methods that rely on 3D annotations.

DINeMo

Preliminaries. Neural mesh models define a probabilistic generative model of feature activations using a 3D neural mesh $\mathfrak{N} = \{\mathcal{V}, \mathcal{E}, \mathcal{C}\}$ with vertices V, edge set \mathcal{E}, and vertex-level features \mathcal{C}. Given pose m, we define the likelihood of a target feature map F as

p(F \mid \mathfrak{N}, m, C_b) = \prod_{i \in \mathcal{FG}} p(f_i \mid \mathfrak{N}, m) \prod_{i' \in \mathcal{BG}} p(f_{i'} \mid C_b)

This allows us to predict object category , 3D/6D object poses , and object shape from an analysis-by-synthesis approach. Neural mesh models are often trained with part-contrastive loss with vertex correspondence lables, which are often obtained from 3D pose annotations in PASCAL3D+ or ImageNet3D .

Our approach. Our goal is to exploit pseudo-correspondence obtained from large visual foundation models, such as DINOv2 . Given object images without 3D annotations, we first generate pseudo-vertex-correspondence from a pretrained SD-DINO model using our bidirectional pseudo-correspondence generation. Then we train our DINeMo model with part correspondence loss on car images from standard 3D pose estimation dataset or abundant unannotated images from the Internet, e.g., from the StanfordCars dataset .

DINeMo overview
Figure 1. Overview of our DINeMo model.

Bidirectional pseudo-correspondence generation. We find that raw pseudo-correspondence can be quite noisy, e.g., keypoints are often mismatched between left and right. We argue that keypoint correspondence matching should consider both local information, i.e., per patch feature similarities, and global context information, i.e., 3D orientation of the object. Based on this motivation, we propose a novel bidirectional pseudo-correspondence generation, which consists of two steps: (i) matching a global pose label from raw keypoint correspondences, and (ii) refine local keypoint correspondences based on the predicted global pose label.

DINeMo overview
Figure 2. Bidirectional pseudo-correspondence generation.

Occlusion-aware analysis-by-synthesis. Previous neural mesh models adopt explicit occlusion reasoning during inference. However, directly predicting occlusion from neural features yields noisy masks and hurts final performance. With the recent rise of foundation models like Segment Anything , we extend standard analysis-by-synthesis inference with Grounded-SAM masks, achieving enhanced occlusion robustness.

Main Results

3D object pose estimation on the car split of Pascal3D+ and occluded PASCAL3D+.

DINeMo overview
Figure 3. 3D object pose estimation on the car split of Pascal3D+ and occluded PASCAL3D+.

Semantic correspondence evaluation on car split of SPair71k. Our DINeMo outperforms all previous methods by a wide margin and achieves comparable performance with Telling Left from Right that use index to flip source keypoints at test time.

DINeMo overview
Figure 4. Semantic correspondence evaluation on car split of SPair71k.

Scaling properties of DINeMo with more unlabled object images used for training.

DINeMo overview
Figure 5. Scaling properties of DINeMo with more unlabled object images used for training.

Qualitative Examples

Qualitative comparisons with and without our bidirectional pseudo-correspondence generation.

DINeMo overview
Figure 6. Qualitative comparisons with and without our bidirectional pseudo-correspondence generation.

Qualitative comparisons between DINOv2 and our DINeMo on SPair71k dataset.

DINeMo overview
Figure 7. Qualitative comparisons between DINOv2 and our DINeMo on SPair71k dataset.

Qualitative pose estimation results on PASCAL3D+ dataset.

DINeMo overview
Figure 8. Qualitative pose estimation results on PASCAL3D+ dataset.

BibTeX

@article{guo2025dinemo,
  title={DINeMo: Learning Neural Mesh Models with no 3D Annotations},
  author={Guo, Weijie and Zhang, Guofeng and Ma, Wufei and Yuille, Alan},
  journal={arXiv preprint arXiv:2412.07825},
  year={2025}
}

Notes

This website template is adapted from Image Sculpting.