PromptHMR: integrating promptable architecture for 3D human mesh recovery from monocular inputs

PromptHMR applies a promptable architecture originally designed for 2D segmentation to the challenging task of 3D human mesh recovery from monocular images and videos. By combining multiple open-source components into a cohesive pipeline, it supports single-image reconstruction as well as multi-person video reconstruction in world coordinates, outputting 3D mesh files for visualization. The codebase focuses on inference and evaluation, with training code withheld due to licensing.

PromptHMR: a unified pipeline for promptable 3D human mesh recovery

PromptHMR is a research implementation targeting promptable 3D human mesh recovery, released alongside a CVPR 2025 paper. The system builds on the promptable architecture of SAM (Segment Anything Model) and extends it to estimate 3D human pose and shape from monocular RGB images or videos.

Under the hood, PromptHMR integrates several open-source components:

SAM2: provides the base promptable segmentation architecture adapted for human mesh recovery.
DROID-SLAM: a state-of-the-art SLAM system used for camera pose and world coordinate tracking in videos.
Metric3D: likely involved in metric 3D reconstruction or depth estimation.
ViTPose: a pose estimation model that aids in detecting human keypoints.
SPEC: a model or method related to shape or pose estimation.

The pipeline supports two main modes:

Single-image reconstruction: estimating 3D human mesh from a single monocular image.
Multi-person video reconstruction: recovering meshes for multiple people over video frames, aligning them in world coordinates using DROID-SLAM.

Outputs include MCS and GLB file formats, which are standard in 3D visualization workflows.

The project requires registration of SMPL or SMPL-X parametric human models, which are popular frameworks for representing human pose and shape parametrically. It provides pretrained checkpoints trained on synthetic datasets BEDLAM1 and BEDLAM2.

Technical strengths: promptable adaptation and modular integration

The key technical strength of PromptHMR lies in adapting SAM’s promptable architecture, originally for 2D segmentation, to 3D human mesh recovery. This reuse of a prompt mechanism across domains is a clever approach. The promptable design enables flexible querying and recovery of human meshes, potentially improving user interaction and segmentation quality.

The integration of multiple specialized models presents a non-trivial engineering challenge. Combining SLAM for world-coordinate tracking with pose estimation and mesh recovery requires careful synchronization and data handling. The use of DROID-SLAM ensures robust camera pose estimates over videos, which is critical for multi-person 3D reconstruction in a consistent coordinate frame.

The code quality appears reasonable for a research codebase, focusing on inference and evaluation pipelines. The training code is not released, which is common when licensing or dataset restrictions apply, but this limits experimentation with training or fine-tuning.

The tradeoffs include:

Inference-only availability: no training scripts limit extending or adapting the models.
Dependency complexity: multiple external models and datasets mean a steep setup curve.
Synthetic dataset training: the pretrained checkpoints are from synthetic data, which may affect real-world performance.

Still, the project consolidates a sophisticated multi-model pipeline that is worth exploring for those interested in 3D human pose and shape estimation.

Quick start: installation with conda environment and optional multi-human video support

PromptHMR provides a straightforward installation script that sets up a conda environment and installs required dependencies, including two supported PyTorch versions.

Here are the exact commands from the README:

git clone https://github.com/yufu-wang/PromptHMR

Then run the installation script with your PyTorch version of choice (either 2.4.0+cu121 or 2.6.0+cu126). If you want to enable the world-coordinate multi-human video pipeline, additional third-party wheels will be installed.

Usage: scripts/install.sh --pt_version <version> [--world-video=<true|false>]

Options:
  --pt_version <version>       PyTorch version to install (2.4 or 2.6)
  --world-video <true|false>   Download required wheels for world-coordinate multi-human video (default: false)
  --help                       Show this help message

Examples:
  scripts/install.sh --pt_version=2.4
  scripts/install.sh --pt_version=2.6
  scripts/install.sh --pt_version=2.4 --world-video=true
  scripts/install.sh --pt_version=2.6 --world-video=false

This script handles the environment setup and dependencies. After installation, users can explore the provided inference and evaluation scripts to run on images and videos.

Verdict: a valuable research pipeline with practical limits

PromptHMR is a solid research codebase integrating a promptable architecture for 3D human mesh recovery with multiple specialized models into a unified pipeline. It is particularly relevant for researchers and developers interested in monocular 3D human pose and shape estimation and multi-person video reconstruction in world coordinates.

The inference-only nature and reliance on synthetic training datasets mean it’s less suited for production use or custom training workflows. The setup involves managing several dependencies and registering parametric human models, which adds complexity.

However, the project showcases how promptable mechanisms designed for 2D segmentation can be adapted for 3D mesh recovery, which is worth understanding for practitioners in computer vision and 3D reconstruction.

If you want to experiment with state-of-the-art 3D human mesh recovery pipelines that combine SLAM, pose detection, and promptable architectures, PromptHMR is a good starting point. Prepare for a somewhat involved setup and limited training flexibility, but expect a well-structured inference pipeline with practical outputs for 3D visualization.

Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an

→ GitHub Repo: yufu-wang/PromptHMR ⭐ 411 · Python

Noureddine RAMDI / PromptHMR: integrating promptable architecture for 3D human mesh recovery from monocular inputs

PromptHMR: a unified pipeline for promptable 3D human mesh recovery

Technical strengths: promptable adaptation and modular integration

Quick start: installation with conda environment and optional multi-human video support

Verdict: a valuable research pipeline with practical limits

Related Articles