MotionCrafter: unified 4D geometry and motion reconstruction from monocular video

MotionCrafter tackles a persistent challenge in computer vision: how to reconstruct both the geometry and motion of objects from a single camera input without the usual two-step post-optimization. This repo presents a unified framework that simultaneously predicts dense 4D geometry and scene flow using a shared Variational Autoencoder (VAE), all mapped in a consistent world coordinate system. The result is a streamlined pipeline for monocular video processing that avoids the complexity and error accumulation common in prior approaches.

unified 4d geometry and motion reconstruction framework

At its core, MotionCrafter is a Python-based implementation of a CVPR 2026 Highlight paper focused on video diffusion models for 4D geometry and motion estimation. The pipeline reconstructs dense point maps and estimates dense object motion (scene flow) directly from monocular video sequences.

The architecture revolves around a shared 4D VAE that jointly models geometry and motion. This contrasts with traditional pipelines where geometry is estimated first, then motion is refined as a separate step — often involving costly post-optimization. Here, the VAE encodes spatial and temporal information into a unified latent space aligned with a global world coordinate system, enabling simultaneous prediction.

The repo builds on the GeometryCrafter codebase, which provides a foundation for 3D geometry reconstruction. MotionCrafter extends this with a training pipeline that covers:

Geometry VAE stage
Unified 4D VAE stage
Diffusion UNet stage for video diffusion

This layered training process supports both deterministic and diffusion-based inference modes, giving users flexibility depending on their accuracy and computational tradeoffs.

Visualization is handled via integration with Viser, allowing 3D inspection of predicted point maps and scene flows. This is crucial for debugging and qualitative evaluation in 4D reconstruction tasks.

shared latent space and elimination of post-optimization

The standout technical feature here is the shared 4D VAE that unifies geometry and motion estimation within a single latent representation and coordinate system. This design choice:

Removes the need for separate post-optimization steps common in monocular reconstruction pipelines
Simplifies the inference process by predicting dense scene flow and geometry jointly
Enables the system to operate directly in a consistent world coordinate frame, improving coherence across frames

The tradeoff is an increased model complexity and training pipeline that requires careful stage-wise training, including the geometry VAE and the diffusion UNet components. The codebase reflects this complexity with modular training scripts and checkpoints.

Under the hood, the code quality is solid and modular, with clear separation of concerns between geometry encoding, diffusion model training, and inference. The repo includes evaluation scripts that measure reconstruction quality, and visualization tools that tie directly to the output data structures.

This unified approach is worth understanding even if you don’t adopt it wholesale, as it challenges the typical two-stage paradigm in monocular video reconstruction.

quick start: install and run inference

If you want to try MotionCrafter quickly, the repo provides a straightforward installation and inference process. Here’s how to get it running with the default model:

# Clone the repo
 git clone https://github.com/TencentARC/MotionCrafter

# Install dependencies
 pip install -r requirements.txt

# Run inference on a sample video
 python run.py \
  --video_path examples/video.mp4 \
  --save_folder examples_output

To run inference with your own trained model, you can specify paths and parameters explicitly:

python run.py \
  --video_path examples/video.mp4 \
  --save_folder examples_output \
  --cache_dir workspace/pretrained_models \
  --unet_path path/to/your/unet \
  --vae_path path/to/your/vae \
  --model_type determ \
  --height 320 --width 640 \
  --adjust_resolution True \
  --num_frames 25

The --model_type flag switches between deterministic (determ) and diffusion-based (diff) inference modes.

For visualization of the results, the repo includes a script integrating with Viser:

python visualize/visualize.py \
  --video_path examples/video.mp4 \
  --data_path examples_output/video.npz

This allows qualitative inspection of the reconstructed 3D point clouds and scene flow.

verdict: who should explore motioncrafter

MotionCrafter is a solid research-grade framework for anyone interested in monocular video 4D reconstruction, especially if you want to bypass the often cumbersome post-optimization step. Its unified 4D VAE architecture is a valuable reference for advancing video diffusion models and joint geometry-motion estimation.

The tradeoff is the complexity of training the staged pipeline and the computational cost involved. It assumes a monocular setup, which inherently limits depth accuracy compared to multi-view or depth sensor approaches. The codebase is Pythonic and well-structured but requires familiarity with deep learning pipelines and diffusion models to extend or retrain effectively.

For practitioners working on video-based 3D reconstruction or motion capture, this repo offers a glimpse into next-generation architectures that integrate geometry and motion tightly. It’s less of a plug-and-play product and more of a technical foundation for experimentation and research.

Overall, MotionCrafter is worth understanding if you deal with monocular video reconstruction and want to explore alternatives to traditional two-stage pipelines that separate geometry and motion estimation.

→ GitHub Repo: TencentARC/MotionCrafter ⭐ 158 · Python

Noureddine RAMDI / MotionCrafter: unified 4D geometry and motion reconstruction from monocular video

unified 4d geometry and motion reconstruction framework

shared latent space and elimination of post-optimization

quick start: install and run inference

verdict: who should explore motioncrafter