3D-RE-GEN reconstructs complete editable 3D indoor scenes from a single RGB photo. It integrates SAM, Hunyuan3D-2.0, and VGGT models in a modular Python pipeline.
Autodistill automates the pipeline from large foundation models to edge-ready vision models using pluggable plugins and a natural language ontology for zero-shot labeling.
Comic Translate uses advanced AI models and a multi-step pipeline for accurate comic translation across languages, combining speech bubble detection, OCR, and LLMs with full-page context.
Fast3R from Meta FAIR processes 1000+ unordered images simultaneously for 3D reconstruction using a ViT-Large backbone and multi-view attention, eliminating iterative matching.
MASt3R-SLAM integrates a pretrained 3D reconstruction model as a geometry prior in a dense SLAM pipeline, enabling real-time tracking and mapping without classical bundle adjustment or depth sensors.
PartCrafter generates multiple semantically distinct 3D mesh parts from a single RGB image using latent diffusion transformers, enabling structured 3D generation with pretrained models and VLM-based part suggestions.
Pixal3D generates high-fidelity 3D assets with PBR textures from a single image using pixel-aligned projection conditioning. It offers a three-stage cascade and low-VRAM mode for consumer GPUs.
SAM3-UNet adapts Meta’s SAM3 foundation model for dense prediction tasks using a parameter-efficient adapter and U-Net decoder, enabling training under 6 GB GPU memory.
Tencent’s HY-World 2.0 generates persistent 3D assets from text, images, or video using a four-stage pipeline. It outputs editable worlds compatible with Blender, Unity, and Unreal Engine.
CodeFormer uses a codebook transformer architecture for blind face restoration, letting users control the tradeoff between quality and fidelity with a unique fidelity weight parameter.
OVIE trains novel view synthesis models using unpaired internet images, avoiding the need for calibrated multi-view datasets. It uses Vision Transformers and foundation models for pose and depth encoding.
StereoWorld uses binocular stereo vision cues to guide 3D-consistent stereo video generation, offering a biologically inspired approach to scene geometry understanding.
Awesome-Deblurring compiles 100+ key papers tracing image and video deblurring from classical optimization to modern deep learning, serving as a go-to bibliography for researchers and developers.
MotionCrafter jointly reconstructs 4D geometry and dense motion from monocular video using a unified 4D VAE, eliminating post-optimization. This Python framework offers training and visualization tools.
MultiWorld offers a unified framework for multi-agent multi-view video world modeling using a frozen VGGT backbone for implicit 3D understanding. It supports scalable multi-agent control and autoregressive inference.
OpenPose is a C++ library for real-time multi-person 2D pose estimation using Part Affinity Fields, enabling constant inference time for body detection regardless of person count.
PEAR predicts expressive 3D human mesh parameters for body, hands, and face simultaneously at 100 FPS using a pixel-aligned architecture based on PyTorch and SMPL-X models.
Viseron is a self-hosted, local-only AI NVR platform in Python with modular AI features for privacy-focused video surveillance. Runs fully locally with Docker deployment.
Cupid is a feed-forward 3D reconstruction model that jointly estimates camera pose and reconstructs 3D objects from single 2D images, outputting textured 3D meshes and radiance fields in seconds.
NAS3R enables self-supervised 3D geometry and camera parameter estimation without ground-truth data, using Gaussian splatting and a VGGT backbone. It supports multi-view setups and optional pretrained initialization.