SimRecon: compositional 3D scene reconstruction with viewpoint optimization and semantic graph synthesis

SimRecon tackles a common challenge in 3D computer vision: turning raw video footage into physically plausible, object-centric 3D scenes that are ready for downstream simulation tasks. Unlike many pipelines that focus solely on geometry reconstruction or semantic segmentation, SimRecon integrates multiple specialized modules to bridge perception, generation, and simulation in a compositional framework.

compositional 3d scene reconstruction from video

At its core, SimRecon converts real-world videos into 3D scenes that can be simulated realistically. The pipeline consists of three main stages: perception, generation, and simulation.

The perception stage extracts detailed geometry and semantic segmentation from video frames. The geometry reconstruction uses 2D Gaussian Splatting (2DGS), a rendering-friendly method that represents scene geometry as a collection of 2D Gaussians to efficiently synthesize novel views. For semantic understanding, it employs CropFormer, an instance-level segmentation model, to detect and segment individual objects within the scene.

One of the key innovations here is the Active Viewpoint Optimization (AVO) module. Instead of relying on arbitrary or fixed camera poses, AVO runs an optimization loop per object instance to select the best viewpoints for reconstruction and semantic inference, improving both accuracy and completeness.

The generation phase uses these optimized views and segmentation to synthesize a layered, object-centric 3D scene representation. This includes a Scene Graph Synthesizer (SGS), which leverages Vision-Language Models (VLMs) to infer spatial and semantic relationships between objects. This step connects raw geometry with high-level scene understanding — creating a graph describing how objects relate to each other, which is critical for realistic simulation.

Finally, the simulation stage activates physical properties layer by layer, assembling the scene into a physically plausible structure ready for downstream simulation engines.

Technically, the project is implemented in Python and combines state-of-the-art deep learning models with optimized rendering and simulation techniques. It integrates external modules like CropFormer for segmentation and Detectron2 for panoptic segmentation support.

bridging geometry and semantics through viewpoint optimization and scene graph synthesis

What distinguishes SimRecon is how it connects different modalities and stages into a coherent pipeline rather than focusing on a single task.

The Active Viewpoint Optimization (AVO) module is particularly interesting. It runs an optimization loop to find the best camera viewpoints for each object instance, which matters because inaccurate or redundant views can degrade the reconstruction quality and semantic inference. By optimizing viewpoints, the system actively improves the input data quality for subsequent stages.

The Scene Graph Synthesizer (SGS) uses Vision-Language Models (VLMs) to infer relationships between segmented objects. This is a step beyond typical instance segmentation or object detection pipelines that stop at labeling. SGS projects object instances to 2D frames and queries VLMs to understand spatial and functional relationships (e.g., “cup on table” or “chair near desk”). This semantic layering makes the final 3D scene more meaningful and usable for simulations where object interactions matter.

The code quality reflects a modular design with clear separation between perception, generation, and simulation stages. The bridging modules between these stages handle data transformations and maintain compositional integrity.

Tradeoffs include the complexity of setting up all dependencies — GPU-enabled PyTorch with CUDA, NVIDIA RAPIDS libraries, and manual checkpoint downloads for CropFormer. The pipeline also depends on pretrained models and external repositories, which can complicate reproducibility.

From a performance standpoint, the optimization loops and multi-stage processing are computationally intensive, which may limit real-time applications but fit well for offline scene reconstruction workflows.

quick start: setting up simrecon environment

To get started with SimRecon, follow these steps exactly as provided in the README:

# 1. Clone Repository
git clone https://github.com/xiac20/SimRecon.git
cd SimRecon

# 2. Environment Setup

# Create conda environment
conda create -n simrecon python=3.9 -y
conda activate simrecon

# Install dependencies
pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 torchaudio==2.1.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu11==24.2.*" "cuml-cu11==24.2.*"

pip install -r requirements.txt

# Additional Setup for CropFormer
cd semantic_modules/CropFormer
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
cd ../../../../
git clone git@github.com:facebookresearch/detectron2.git
cd detectron2
pip install -e .
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/mcordts/cityscapesScripts.git
cd ..
pip install -r requirements.txt
pip install -U openmim
mim install mmcv
pip install transformers
mkdir ckpts

Finally, you need to manually download the CropFormer checkpoint and place it into semantic_modules/CropFormer/ckpts.

This setup ensures the environment is ready for running segmentation, geometry reconstruction, and simulation components.

verdict: a modular pipeline for simulation-ready 3d scene reconstruction

SimRecon is a solid, technically thorough framework for converting real-world videos into compositional 3D scenes with semantic understanding and physical plausibility. Its strength lies in the integrated pipeline combining optimized viewpoints, instance segmentation, and semantic scene graph synthesis.

It’s relevant for researchers and practitioners working on 3D reconstruction, robotics simulation, or any application requiring detailed, object-centric scene models. The complexity of dependencies and setup means it’s less suited for quick experiments or casual use, but the modular codebase and clear pipeline stages make it a useful foundation for further development.

The tradeoff is clear: you get a richer, semantically aware 3D scene at the cost of increased computational demands and setup complexity. Those willing to invest the time will find a capable framework bridging 3D geometry with semantic scene understanding, a gap that many other projects only partially address.

ComfyUI: modular visual workflows for diffusion model experimentation — ComfyUI offers a graph/node interface for building complex diffusion model workflows offline, blending modularity with f
glTF-Sample-Assets: a curated collection of glTF models for 3D development and testing — glTF-Sample-Assets offers a curated set of 3D models in glTF format, organized for testing and showcasing glTF capabilit
Deep-Live-Cam: Real-time face swapping optimized across diverse hardware with ONNX Runtime — Deep-Live-Cam offers real-time face swapping and deepfake video generation using ONNX Runtime with multiple execution pr
Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
Agno: Building production-ready agentic software with minimal code — Agno provides a minimal, production-ready Python framework for scalable agentic software with per-user isolation and nat

→ GitHub Repo: xiac20/SimRecon ⭐ 100 · Python

Noureddine RAMDI / SimRecon: compositional 3D scene reconstruction with viewpoint optimization and semantic graph synthesis

compositional 3d scene reconstruction from video

bridging geometry and semantics through viewpoint optimization and scene graph synthesis

quick start: setting up simrecon environment

verdict: a modular pipeline for simulation-ready 3d scene reconstruction

Related Articles