Cupid: feed-forward 3D reconstruction with joint camera pose estimation from single images

Cupid tackles a challenging problem in 3D vision: reconstructing an object’s 3D geometry and estimating the camera pose from just a single 2D image — all in a single feed-forward pass. This is notable because most pipelines treat pose estimation and reconstruction as separate steps, often requiring iterative optimization or multi-view input. Cupid’s joint modeling approach enables it to output 3D Gaussians, textured meshes, and radiance fields quickly, with camera parameters aligned to the reconstruction. This makes it possible to composite reconstructed objects back into the original scene with accurate scale, placement, and lighting.

what cupid does: joint 3d reconstruction and camera pose estimation

Cupid is a generative model that takes a single RGB image as input and produces a 3D reconstruction of the object depicted, along with the camera extrinsics and intrinsics used to capture that image. Its architecture is built on top of TRELLIS, a framework for 3D Gaussian splatting.

The model outputs several 3D representations:

3D Gaussians that represent the scene geometry as volumetric elements
Radiance fields for view-dependent appearance
Textured meshes in GLB format, which can be exported to Blender for further editing or rendering

It supports both single-object reconstruction and multi-object scenes when provided with segmentation masks to separate individual objects. This compositional capability allows complex scenes to be rebuilt with correct object placement.

Under the hood, the feed-forward pipeline avoids costly optimization loops, producing results in seconds on a suitable GPU. The model is pretrained and automatically downloaded from Hugging Face (hbb1/Cupid), simplifying the inference workflow.

The codebase is implemented as Jupyter Notebooks, primarily Python-based, and leverages GPU acceleration through CUDA. It requires a Linux environment and an NVIDIA GPU with at least 16GB of VRAM to run efficiently. The tested CUDA versions are 11.8 and 12.2.

why cupid stands out: joint pose and geometry modeling with 3d gaussian splatting

The joint estimation of camera pose and 3D geometry in a single feed-forward pass is what differentiates Cupid from many other 3D reconstruction methods. Typically, pose estimation is done separately via classical methods or learned pose regressors, and 3D reconstruction relies on multi-view or time-consuming optimization. Cupid sidesteps this by integrating both tasks into one model.

Another technical strength is its use of 3D Gaussian splatting, which represents the scene as a collection of 3D Gaussians. This approach allows efficient rendering and reconstruction, balancing quality and speed. The output of textured meshes (GLB) and radiance fields means the results are versatile for downstream applications like rendering, animation, or augmented reality.

The code quality is fairly high for a research-oriented repo: the implementation is organized in notebooks, making it accessible for experimentation and visualization. The use of a pretrained model means users don’t need to train from scratch, which would be resource-intensive.

Tradeoffs include the hardware requirements — a GPU with at least 16GB VRAM is non-trivial, limiting accessibility. The software stack is Linux-focused and depends on CUDA, which can complicate setup on other platforms.

The design decision to do everything feed-forward means the method is fast but might sacrifice some reconstruction accuracy compared to methods that iteratively optimize camera pose and geometry. However, for many use cases, the speed and joint modeling advantages outweigh these potential downsides.

quick start

prerequisites

Linux system (Windows support is not fully tested)
NVIDIA GPU with 16GB+ VRAM (tested on A100 and A6000)
CUDA Toolkit 11.8 or 12.2
Conda for environment management
Python 3.8 or higher

installation steps

# Clone the repo with submodules
git clone --recurse-submodules https://github.com/cupid3d/Cupid.git
cd Cupid

# Create conda environment and install dependencies
# Note the flags and environment options described below
. ./setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast --pytorch3d --moge

The setup script supports several options:

--new-env creates a new conda environment named cupid.
CUDA version defaults to 11.8 with PyTorch 2.4.0. If you have CUDA 12.2, install PyTorch manually.
flash-attn backend is default; unsupported GPUs (e.g., V100) require setting backend to xformers.

Once installed, the pretrained model is auto-downloaded, and you can run inference notebooks to reconstruct 3D objects from images.

verdict

Cupid is a specialized tool for researchers and practitioners working on 3D vision problems where joint camera pose estimation and 3D object reconstruction from a single image is desired. It excels in producing multi-representation outputs quickly, making it suitable for workflows that require fast turnaround and compositional scene understanding.

The hardware and OS requirements are the main limitations. If you have access to a suitable Linux machine with a high-memory NVIDIA GPU, the repo is worth exploring. The feed-forward nature means you get results in seconds without iterative refinement, which can be a big DX win.

For general use, the complexity of setup and dependency on specific CUDA versions might be a barrier. But as a research tool or prototype for integrating 3D reconstruction with pose estimation, Cupid provides a clear example of how to unify these tasks efficiently.

In short, if your work involves 3D reconstruction pipelines and you want to experiment with joint pose modeling in a feed-forward framework, Cupid is a solid choice to study and build upon.

Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernels — DeepEP is a CUDA-based communication library designed for Mixture-of-Experts models, delivering high-throughput GPU kern
Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
Deep-Live-Cam: Real-time face swapping optimized across diverse hardware with ONNX Runtime — Deep-Live-Cam offers real-time face swapping and deepfake video generation using ONNX Runtime with multiple execution pr
ComfyUI: modular visual workflows for diffusion model experimentation — ComfyUI offers a graph/node interface for building complex diffusion model workflows offline, blending modularity with f

→ GitHub Repo: cupid3d/Cupid ⭐ 205 · Jupyter Notebook

Noureddine RAMDI / Cupid: feed-forward 3D reconstruction with joint camera pose estimation from single images