SceneMaker tackles a persistent challenge in 3D vision: generating accurate 3D scenes from images where objects are heavily occluded or unknown. The key idea is cleanly separating the de-occlusion step — reasoning about what’s hidden behind visible objects — from the actual 3D object generation. This decoupled architecture lets each component specialize and reduces the complexity of end-to-end training on open-set scenes with severe occlusion.
What SceneMaker does: decoupled 3D scene generation with de-occlusion
SceneMaker is an open-source framework from IDEA Research that implements a pipeline for 3D scene generation from images. Unlike typical monolithic models that try to predict 3D scenes in one end-to-end pass, SceneMaker splits the problem into distinct stages.
The core stages are:
- De-occlusion using FLUX Kontext, a model that predicts the full scene context beyond visible surfaces, effectively inferring what’s hidden.
- 3D object generation using Step1X-3D, a separate model specialized in reconstructing 3D shapes and textures.
- Pose estimation with a unified model that combines global and local attention to accurately predict object poses within the scene.
This modular design is intended to handle open-set scenes where objects may be unknown classes and where heavy occlusion is common. By decoupling de-occlusion, the system can better hypothesize occluded parts without confusing the 3D reconstruction model.
The repo is primarily Python-based, relying on deep learning frameworks and external dependencies such as the MoGe repo for depth estimation and Step1X-3D for 3D generation. Checkpoints for pretrained models are available on Hugging Face, enabling inference and further training.
Why SceneMaker’s decoupled architecture matters
The standout feature of SceneMaker is its separation of concerns. End-to-end 3D scene generation models often struggle with occlusion because the model must simultaneously guess hidden geometry and generate 3D shapes. This conflation leads to blurry or inaccurate results, especially in open-set scenarios with novel objects.
By splitting de-occlusion from 3D generation, SceneMaker allows each module to focus on a narrower task:
- The de-occlusion module (FLUX Kontext) is optimized to infer scene context and occluded regions using global scene understanding.
- The 3D generation module (Step1X-3D) can focus solely on reconstructing visible object geometry and texture.
This division plays out in the training and inference workflow, simplifying each model’s objective and improving robustness. The unified pose estimation model adds another layer of precision by combining global context and local attention, which helps in complex scenes with multiple occluded objects.
The tradeoff is that the pipeline is more complex to set up, requiring multiple repos, checkpoints, and careful orchestration. The open-source release also notes some deviations from the original paper’s implementation, so results may differ slightly. However, this modularity improves maintainability and makes it easier to swap or upgrade components independently.
Under the hood, the codebase is surprisingly clean given the complexity. The repo provides training scripts, inference workflows, and dataset preparation code. The documentation points clearly to the dependencies, checkpoint locations, and installation steps.
Quick start: installation and setup
SceneMaker requires Python 3.10 and several dependencies. The installation process involves multiple steps, reflecting the decoupled architecture:
- Install Python dependencies:
pip install -r requirements.txt
Install the MoGe repository for depth estimation, following instructions from https://github.com/microsoft/MoGe.
Clone the Step1X-3D repository for 3D object generation:
git clone --depth 1 --branch main https://github.com/stepfun-ai/Step1X-3D.git
- Download pretrained checkpoints from Hugging Face and place them in the
ckpts/folders:
- SceneMaker checkpoints: https://huggingface.co/horizon171852/SceneMakerSceneMaker
This setup reflects the pipeline’s modular nature, requiring you to coordinate multiple repos and checkpoints. Once installed, the repo includes scripts for training and inference that leverage these components.
who should explore SceneMaker
SceneMaker is relevant for researchers and developers dealing with 3D scene understanding, especially in scenarios with heavy occlusion and open-set conditions where novel objects appear. Its decoupled approach offers a cleaner architectural pattern compared to monolithic end-to-end models.
That said, it’s not a plug-and-play tool for casual use. The dependencies on external repos and checkpoints, plus the slight differences from the original paper’s implementation, mean you’ll need to invest some time in setup and experimentation.
The code is accessible and well-organized, making it a good base for further research or adaptation. If your work involves 3D reconstruction in cluttered scenes or you want to explore modular AI pipelines for vision tasks, SceneMaker’s design is worth understanding.
In production contexts where tight integration or real-time performance is critical, the multi-repo setup and inference complexity might be limiting. But as a research platform, it cleanly separates hard problems and offers components that can be improved independently.
Overall, SceneMaker demonstrates that separating de-occlusion from 3D generation is a practical architectural choice for tackling occluded open-set scenes, a problem most existing pipelines struggle with.
→ GitHub Repo: IDEA-Research/SceneMaker ⭐ 115 · Python