NOVA3R tackles a known limitation in 3D reconstruction: most methods rely on pixel-aligned correspondences between images and 3D space, which restricts their ability to recover occluded or physically plausible geometry. This repo presents a non-pixel-aligned visual transformer approach that enables amodal 3D reconstruction from unposed multi-view images, meaning it does not require camera intrinsics or extrinsics. Instead, it reconstructs complete 3D geometry including regions hidden from view.
what nova3r does: non-pixel-aligned amodal 3d reconstruction
At its core, NOVA3R implements an architecture described in an ICLR 2026 paper for amodal 3D reconstruction using a visual transformer that is not tied to pixel alignment. Unlike pixel-aligned methods such as DUSt3R, which map pixels directly to 3D points, NOVA3R uses a two-stage pipeline.
The first stage is a point-conditioned autoencoder (AE) that learns a latent representation of 3D point clouds capturing full geometry, including occluded parts. The second stage is an image-to-point reconstruction transformer that maps image features to this latent space to reconstruct the full 3D point cloud.
This decoupling allows the model to handle unposed images — images without known camera parameters — and still produce physically plausible reconstructions. It supports single-image inputs as well as two-image (multi-view) inputs.
The outputs are .ply point cloud files representing the reconstructed geometry and .mp4 360° rotation videos that visualize the 3D shapes.
The model is trained on the 3DFront and Scannetpp datasets, leveraging diverse indoor scenes for robust learning.
The stack is Python-based using PyTorch 2.2+ with CUDA 12.1+ for GPU acceleration. Due to model and data size, a high-end NVIDIA GPU with at least 24GB VRAM is required (48GB recommended). Checkpoints range from 262MB for the autoencoder to 5.8GB for the image-to-point models.
technical strengths and tradeoffs: transformer without pixel alignment
The main technical strength of NOVA3R is the non-pixel-aligned transformer design. Most 3D reconstruction approaches rely on correspondences between pixels and 3D points, which inherently limits their ability to recover occluded surfaces or regions behind visible geometry. NOVA3R sidesteps this limitation by encoding point clouds into a latent space independent of pixel positions before reconstructing from images.
This leads to better occlusion completion and more physically plausible outputs.
The two-stage pipeline cleanly separates the problem: the autoencoder learns a compact 3D representation, and the image-to-point transformer learns to map images to that space. This modularity means the AE can be reused or fine-tuned independently.
Tradeoffs are clear:
The model size and VRAM requirements are significant — training or inference requires a GPU with ≥24GB VRAM, which limits accessibility.
The approach requires pretraining the AE and then training the image-to-point model, increasing the training complexity.
Since it does not rely on camera poses, the method is robust to unposed images but may lose accuracy compared to pose-aware methods on well-calibrated multi-view data.
The codebase itself is surprisingly clean given the complexity. The repo includes:
- PyTorch model implementations for both stages
- Dataset loaders for 3DFront and Scannetpp
- Scripts for training, evaluation, and inference
- Python API for programmatic access
This structure supports experimenting with single or multi-view inputs and producing visualization videos.
quick start
To get started with NOVA3R, the repo provides clear environment and setup instructions:
# Requirements
- Python 3.10
- PyTorch 2.2+ with CUDA 12.1+
- NVIDIA GPU with ≥24GB VRAM (48GB recommended)
# Automated setup
bash setup.sh
This will install dependencies and prepare the environment. Note the hardware requirements — running on consumer GPUs below 24GB VRAM will likely fail due to memory constraints.
verdict: for practitioners with access to high-end GPUs aiming for amodal 3d reconstruction
NOVA3R is worth exploring if you need to reconstruct complete 3D geometry from images without camera calibration, especially if occlusion completion and physical plausibility are priorities. Its non-pixel-aligned transformer approach is a notable departure from traditional pixel-correspondence methods.
However, the computational and memory demands are high. The large model sizes and GPU VRAM requirements put it out of reach for casual users or those without access to powerful hardware.
The repo offers a clean, modular implementation that can serve as a solid foundation for research or production projects in 3D reconstruction from unposed multi-view images. It’s particularly relevant if you are interested in transformer architectures for 3D tasks and want to explore alternatives to pixel-aligned pipelines.
Related Articles
- Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
- Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
- PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
- Deep-Live-Cam: Real-time face swapping optimized across diverse hardware with ONNX Runtime — Deep-Live-Cam offers real-time face swapping and deepfake video generation using ONNX Runtime with multiple execution pr
- glTF-Sample-Assets: a curated collection of glTF models for 3D development and testing — glTF-Sample-Assets offers a curated set of 3D models in glTF format, organized for testing and showcasing glTF capabilit
→ GitHub Repo: wrchen530/nova3r ⭐ 105 · Python