PartCrafter: compositional 3D mesh generation with latent diffusion transformers

PartCrafter approaches 3D mesh generation differently by producing multiple semantically meaningful parts simultaneously from a single RGB image, rather than generating a single monolithic mesh. This compositional paradigm opens up possibilities for downstream editing, manipulation, and scene composition that typical 3D generators don’t support.

compositional 3d mesh generation from a single rgb image

At its core, PartCrafter is a Python-based implementation of a NeurIPS 2025 paper that introduces compositional 3D mesh generation using Latent Diffusion Transformers. The system generates multiple part-level meshes (e.g., chair legs, seat, back) in one shot from a single RGB image, explicitly modeling the structure of the object.

The architecture builds on a pretrained DiT (Diffusion Transformer) backbone originally from TripoSG, fine-tuned with a fixed Variational Autoencoder (VAE) to support latent diffusion over 3D mesh parts. This latent diffusion transformer outputs multiple parts simultaneously, distinguishing it from monolithic 3D mesh generation approaches that produce one blob representing the entire object.

Additionally, the system supports both object-level and scene-level generation, enabling compositional scene synthesis from images.

To bridge the gap between real-world photos and the Objaverse domain, PartCrafter integrates style transfer techniques via Gemini, allowing the model to generalize better to real-world inputs. Another notable integration is a Vision-Language Model (VLM) based part count suggestion mechanism that automatically infers how many parts to generate from an input image, reducing manual parameter tuning.

The repo is built in Python and uses PyTorch 2.5.1 with CUDA 12.4 support, targeting NVIDIA H20 GPUs for training and inference. The codebase is fully open source, with pretrained model weights available on HuggingFace.

architectural strengths and design tradeoffs

What sets PartCrafter apart is its compositional approach to 3D mesh generation. Instead of producing a single mesh, it simultaneously generates multiple part-level meshes with semantic meaning. This explicit part-level structure enables downstream tasks such as part manipulation, editing, or recomposition, which are difficult to achieve with monolithic mesh outputs.

The use of a latent diffusion transformer fine-tuned on the DiT backbone is a key architectural choice. Diffusion models have shown strong generative capabilities, and combining them with transformers allows capturing long-range dependencies and complex structures in the latent space. The fixed VAE helps map high-dimensional mesh data into a latent space where diffusion operates efficiently.

The integration of VLM-based part count suggestion is a practical enhancement. Instead of requiring users to guess the number of parts, the system uses a vision-language model to propose a part count, improving usability for real-world inputs.

Style transfer via Gemini is another thoughtful addition, addressing the domain gap between synthetic Objaverse images used for training and real-world photos used for inference. This improves robustness but adds complexity and dependency on external models.

Tradeoffs include the heavy computational requirements, given the use of high-end GPUs (NVIDIA H20) and the PyTorch 2.5.1 CUDA 12.4 stack. The system’s complexity in handling multiple parts simultaneously also means longer training and inference times compared to simpler single-mesh generators.

The codebase itself appears well organized with modular scripts for inference and training, and explicit environment setup instructions enhancing developer experience. However, the model and tooling are research-focused; users should expect some setup overhead and experimental behavior.

quick start with partcrafter

The repo provides clear installation and quickstart instructions. Environment setup requires Python 3.11 and PyTorch 2.5.1 with CUDA 12.4. Here are the commands copied verbatim from the README:

conda create -n partcrafter python=3.11.13
conda activate partcrafter
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

Then clone and install dependencies:

git clone https://github.com/wgsxm/PartCrafter.git
cd PartCrafter
bash settings/setup.sh

For users without root access, additional graphics libraries can be installed via conda:

conda install -c conda-forge libegl libglu pyopengl

To generate a 3D part-level mesh from an image with 3 parts and render it, use:

python scripts/inference_partcrafter.py \
  --image_path assets/images/np3_2f6ab901c5a84ed6bbdf85a67b22a2ee.png \
  --num_parts 3 --tag robot --render

The needed pretrained weights are fetched automatically:

PartCrafter model from wgsxm/PartCrafter → pretrained_weights/PartCrafter
RMBG model from briaai/RMBG-1.4 → pretrained_weights/RMBG-1.4

The generated results are saved under ./results/robot. Several example images are included in ./assets/images with filenames encoding recommended part counts.

The system also supports automatic part count suggestion using a Vision-Language Model (VLM) if the user prefers not to specify --num_parts manually.

verdict

PartCrafter offers a concrete implementation of a compositional 3D mesh generator using latent diffusion transformers, which stands out by producing multiple semantically distinct parts from single RGB inputs. This is a valuable step towards structured 3D generation enabling downstream editing and scene composition.

The repo’s code quality and documentation make it accessible for researchers and practitioners familiar with diffusion models and 3D generation pipelines. However, the resource requirements and complexity mean it’s best suited for experimental research setups or advanced prototyping rather than lightweight or production use.

If you’re working on 3D generative models, especially with a focus on compositionality or part-level manipulation, PartCrafter is worth exploring. The integration of VLM-based part count prediction and style transfer adds practical usability.

That said, expect a learning curve due to the model complexity and dependencies on high-end GPUs. The current pretrained weights and inference scripts provide a good starting point to experiment with this compositional approach.

Overall, PartCrafter is a solid research artifact that pushes the boundary of single-shot, part-level 3D mesh generation from images, balancing architectural innovation with practical tooling for exploration and development.

MotionCrafter: unified 4D geometry and motion reconstruction from monocular video — MotionCrafter jointly reconstructs 4D geometry and dense motion from monocular video using a unified 4D VAE, eliminating
Tencent Hunyuan3D-Part: a two-stage pipeline for semantic 3D mesh part segmentation and generation — Tencent’s Hunyuan3D-Part offers a two-model pipeline for 3D mesh part segmentation with P3-SAM and high-fidelity part ge
WorldGrow: Hierarchical infinite 3D world synthesis with block-wise growth and coarse-to-fine refinement — WorldGrow generates infinite 3D worlds via hierarchical block-wise synthesis with coarse-to-fine refinement, ensuring se
SceneMaker: a decoupled framework for 3D scene generation with de-occlusion — SceneMaker separates de-occlusion from 3D object generation to handle occluded open-set scenes. It uses FLUX Kontext and
Matrix-3D: a practical pipeline for omnidirectional 3D world generation optimized for consumer GPUs — Matrix-3D generates explorable 360-degree 3D worlds from text or images using panoramic video and 3D Gaussian splatting,

→ GitHub Repo: wgsxm/PartCrafter ⭐ 2,420 · Python

Noureddine RAMDI / PartCrafter: compositional 3D mesh generation with latent diffusion transformers

compositional 3d mesh generation from a single rgb image

architectural strengths and design tradeoffs

quick start with partcrafter

verdict

Related Articles