Noureddine RAMDI / ROMP: from real-time monocular 3D human mesh recovery to temporal tracking with dynamic cameras

Created Mon, 04 May 2026 10:03:52 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

Arthur151/ROMP

ROMP is a rare example of a research codebase that evolves gracefully across multiple top-tier CVPR/ICCV publications, each adding a meaningful layer of complexity and real-world relevance. It started with ROMP, a one-stage monocular multi-person 3D mesh recovery method, then extended to BEV with depth-aware placement for all age groups, and finally TRACE, which adds temporal tracking of 5D avatars (3D pose plus global trajectory) under dynamic camera motion. This progression is not just academic — the repo ships as a production-ready pip package with ONNX acceleration, export capabilities, and Docker deployment.

architecture and core functionality of ROMP, BEV, and TRACE

At its core, the ROMP family tackles monocular 3D human mesh recovery — reconstructing detailed 3D human body meshes from a single RGB camera input. The key challenge here is to infer depth and pose for multiple people in real time. ROMP introduced a one-stage regression scheme that predicts the parameters of the SMPL parametric body model directly from the image, bypassing expensive multi-stage pipelines. It uses a center-map based training approach for multi-person detection, allowing simultaneous detection and mesh reconstruction.

BEV builds on ROMP by modeling explicit depth relationships between people and supporting all age groups, which means it can handle children as well as adults. This is crucial for applications that require family or crowd scenes. TRACE adds a temporal dimension, tracking 5D avatars — which means it recovers not just the 3D pose but the global trajectory of each person — even under dynamic camera motion. This is important for video applications where the camera moves.

The architecture uses PyTorch for training and inference, with a focus on real-time performance. The repo includes a cross-platform pip package called simple-romp that supports ONNX acceleration for faster inference. It also supports export to common 3D formats like fbx, glb, and bvh, making it compatible with Blender and other 3D tools. Docker support simplifies deployment and environment management.

technical strengths and design tradeoffs

What stands out in ROMP is the one-stage regression approach for multi-person 3D mesh recovery. Many systems rely on complex multi-stage pipelines that detect keypoints first, then fit meshes or optimize parameters. ROMP simplifies this with a direct regression model, which improves speed and reduces pipeline complexity.

The use of the SMPL parametric body model is standard in the field but well integrated here. The repo balances predictability and flexibility by relying on this parametric model, which constrains the output mesh to realistic human shapes and poses.

Center-map based training for multi-person detection is an elegant design choice that makes the system scalable to multiple people without a combinatorial explosion in processing time.

Adding depth relationship modeling in BEV addresses a common limitation in monocular 3D reconstruction — ambiguity in relative depth ordering. Supporting all age groups broadens applicability but requires additional data and model adjustments.

TRACE’s temporal tracking is a significant technical addition, enabling the system to work with moving cameras and maintain consistent identity tracking over time. This adds complexity and computational overhead but is essential for dynamic scenes and video applications.

The codebase’s support for ONNX acceleration is a practical strength, enabling faster inference on various hardware without deep framework dependencies. Export functionality to standard 3D file formats facilitates integration with downstream tools and pipelines.

The tradeoff is the inherent limitation of monocular input: depth ambiguity and occlusion remain challenging, and the accuracy depends on the quality of training data and model assumptions. Real-time performance is impressive but may come at the cost of some precision compared to heavier optimization-based methods.

explore the project

The repo documentation points users primarily to the simple-romp pip package for inference. The rest of the code is mainly for training and research purposes.

There is no direct quickstart command in the README, but the Docker usage is documented separately in docker.md, providing a straightforward way to deploy and run the system in a controlled environment.

The codebase is organized around the three main papers: ROMP, BEV, and TRACE, each with its own model definitions, training code, and evaluation scripts. The parametric body models (SMPL) and center-map based detection layers are the core components.

Key resources to explore include:

  • The simple-romp package for inference with ONNX support
  • Export scripts for fbx, glb, and bvh formats
  • Docker deployment instructions in docker.md
  • The training and evaluation code under respective folders for ROMP, BEV, and TRACE

Reading through the documentation and understanding the model’s parameters and expected inputs is essential before attempting training or fine-tuning.

verdict

ROMP and its extensions BEV and TRACE offer a well-engineered progression for monocular multi-person 3D mesh recovery, moving from single-frame regression to full temporal tracking under dynamic cameras. The repo strikes a good balance between research innovation and practical deployment with its pip package, ONNX acceleration, and Docker support.

It’s relevant for researchers and developers working on human pose estimation, 3D reconstruction, and avatar tracking from monocular video. The tradeoff is the usual one for monocular methods — depth ambiguity and occlusion challenges remain.

For production use, the simple-romp package and Docker deployment provide a usable and accelerated inference path. However, real-time tracking for dynamic scenes will require good hardware and careful tuning.

Overall, ROMP is worth understanding if you are building systems involving human 3D pose and mesh recovery from monocular inputs and want a research-backed, production-aware codebase.


→ GitHub Repo: Arthur151/ROMP ⭐ 1,524 · Python