OVIE: Monocular novel view synthesis without multi-view supervision

OVIE tackles a well-known bottleneck in novel view synthesis: the need for multi-view image pairs with calibrated cameras. Instead, it trains entirely on unpaired monocular images scraped from the internet, sidestepping the costly data acquisition and calibration steps. This approach opens up new possibilities for learning 3D representations from abundant 2D data without explicit multi-view supervision.

what ovie does: monocular novel view synthesis from unpaired images

OVIE implements a monocular novel view synthesis framework that eliminates the traditional reliance on multi-view image pairs. Instead of requiring calibrated camera pairs or multi-view datasets, OVIE learns to generate new views purely from unpaired images collected from the internet.

At its core, OVIE uses a Vision Transformer backbone enhanced with several modern architectural modifications:

QK-norm for stabilized attention computations
SwiGLU activation for improved non-linearity
RMSNorm normalization
Rotary positional embeddings (RoPE) for encoding spatial information

Beyond the backbone, OVIE integrates pretrained foundation models to bootstrap learning:

DINOv2 and DINOv3 for self-supervised feature extraction
MoGe depth estimator for pseudo-depth supervision
VGGT model for geometric feature extraction
RAE (Rotary AutoEncoder) for pose encoding from camera extrinsics

The model is designed to encode pose information via camera extrinsics and leverages these foundation models to compensate for the lack of multi-view supervision.

The repository is implemented primarily as Jupyter notebooks, making it accessible for experimentation and inference. It supports distributed training through PyTorch’s torchrun utility, enabling scalability across multiple GPUs. Pretrained weights are publicly available on the Hugging Face Hub, with automatic download support integrated into the codebase.

The use of uv for Python environment management stands out in the developer experience, offering a faster and more reliable alternative to traditional Python packaging tools.

technical strengths and architectural tradeoffs

OVIE’s key technical strength lies in its ability to learn novel view synthesis from unpaired monocular images, which is a challenging problem because it lacks explicit geometric supervision.

The choice of a Vision Transformer backbone with QK-norm, SwiGLU, and RMSNorm highlights a modern approach to transformer design. QK-norm helps stabilize attention scores and gradients, which can be crucial given the complexity of the task and the size of the datasets. SwiGLU, a variant of the GLU activation, provides a smoother non-linearity that often improves convergence. RMSNorm offers a lightweight normalization alternative to LayerNorm, reducing computational overhead.

Rotary positional embeddings (RoPE) are used to inject spatial information into the transformer tokens, which is essential for encoding relative positions in images and camera poses.

The integration of foundation models is a standout design choice. Instead of training everything from scratch, OVIE uses pretrained DINOv2/v3 for extracting robust image features learned from self-supervised learning on large datasets. The MoGe depth estimator provides pseudo-depth cues, which act as a proxy supervision signal to guide the model’s understanding of scene geometry. VGGT and RAE models further enrich the feature space and pose encoding.

This modular approach means OVIE benefits from the strengths of each foundation model. However, the tradeoff is increased complexity in dependencies and potentially higher resource requirements during training and inference.

The codebase’s use of Jupyter notebooks aids in rapid prototyping and visualization but may not be ideal for production deployment without refactoring into standalone scripts or services.

Distributed training support via torchrun is a practical addition, allowing researchers and engineers to scale experiments across multiple GPUs, which is often necessary for transformer-based models and large datasets.

Finally, the use of uv for environment management is a nice touch. It ensures all dependencies, including the precise Python version (3.10.9), are handled consistently. This contributes to reproducibility and a smoother setup process.

quick start

The project provides a clear and streamlined installation process using uv by Astral, which manages the Python environment and dependencies efficiently.

Install uv (macOS/Linux):

curl -LsSf https://astral.sh/uv/install.sh | sh

Alternatively, on macOS you can use Homebrew:

brew install uv

Clone the OVIE repository and synchronize dependencies:

git clone https://github.com/AdrienRR/ovie.git
cd ovie
uv sync

Use uv run prefix to run commands inside the managed environment.

This setup ensures you have the exact Python version and dependencies as intended by the project maintainers, improving DX and reducing “works on my machine” issues.

verdict

OVIE is a technically interesting project that pushes the boundaries of monocular novel view synthesis by removing the need for multi-view supervision. Its architecture skillfully combines a modern Vision Transformer backbone with multiple foundation models for pose, depth, and feature extraction.

The codebase is accessible for experimentation thanks to Jupyter notebooks and offers distributed training capabilities for larger-scale experiments. The use of uv for environment management is a solid choice for reproducibility.

That said, OVIE is primarily research-oriented. The reliance on multiple pretrained models and the complexity of the pipeline may limit straightforward production deployment. Also, training such transformers demands significant compute resources.

This repo is most relevant for researchers and engineers working on 3D vision, novel view synthesis, or those interested in leveraging foundation models for geometry tasks. It’s worth understanding OVIE’s approach even if you don’t adopt it directly, as it tackles a real problem in 3D reconstruction with a fresh angle.

Overall, OVIE offers a clean and practical codebase to explore monocular novel view synthesis without the traditional multi-view data bottleneck.

→ GitHub Repo: kyutai-labs/ovie ⭐ 62 · Jupyter Notebook

Noureddine RAMDI / OVIE: Monocular novel view synthesis without multi-view supervision

what ovie does: monocular novel view synthesis from unpaired images

technical strengths and architectural tradeoffs

quick start

verdict