APISR is a Python repo for AI-powered image and video super-resolution, offering fast Gradio inference and full-featured regular inference with dataset curation tools.
DeepSpeed is a Python library that optimizes large-scale deep learning training with multi-hardware support and JIT CUDA extensions. Explore its architecture, strengths, and quick installation.
DualSDF separates coarse semantic structure from fine geometric detail in 3D shape modeling using a two-level signed distance function. It enables intuitive shape edits with pretrained models and a WebGL demo.
Fast3R from Meta FAIR processes 1000+ unordered images simultaneously for 3D reconstruction using a ViT-Large backbone and multi-view attention, eliminating iterative matching.
Hivemind is a PyTorch library enabling decentralized deep learning over the internet using a peer-to-peer Distributed Hash Table (DHT). It supports fault-tolerant training and decentralized parameter averaging without global sync.
MASt3R-SLAM integrates a pretrained 3D reconstruction model as a geometry prior in a dense SLAM pipeline, enabling real-time tracking and mapping without classical bundle adjustment or depth sensors.
OmniGen2 unifies visual understanding, text-to-image generation, and image editing using distinct decoding pathways for text and images, built on Qwen-VL-2.5 with CPU offloading for accessibility.
PartCrafter generates multiple semantically distinct 3D mesh parts from a single RGB image using latent diffusion transformers, enabling structured 3D generation with pretrained models and VLM-based part suggestions.
SVFR combines blind face restoration, colorization, and inpainting in a single stable video diffusion model, enabling efficient multi-task video face enhancement.
CodeFormer uses a codebook transformer architecture for blind face restoration, letting users control the tradeoff between quality and fidelity with a unique fidelity weight parameter.
AniGen is a Linux-only Python project for 3D animation generation using NVIDIA GPUs and CUDA. It integrates PyTorch, spconv, and pytorch3d with a smooth setup script for complex dependencies.
ComfyUI-Trellis2 integrates facebook’s Dinov3 model into ComfyUI for advanced 3D-aware diffusion workflows. This article breaks down its architecture, strengths, and installation steps.
DIMO distills motion priors from text-conditioned and multi-view video models into a shared latent space, enabling diverse 3D motion generation for arbitrary objects using 3D Gaussian splatting and 4D rendering.
DROID-W builds on DROID-SLAM to handle dynamic scenes in-the-wild by jointly estimating camera pose, scene structure, and dynamic uncertainty using Lie group optimization and metric depth estimation.
Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.
Omni-Diffusion models text, image, and speech tokens jointly via masked discrete diffusion, enabling any-to-any multimodal generation with a single unified model.
PEAR predicts expressive 3D human mesh parameters for body, hands, and face simultaneously at 100 FPS using a pixel-aligned architecture based on PyTorch and SMPL-X models.
LingBot-Map performs streaming 3D reconstruction from long image sequences at ~20 FPS using a geometric context transformer and paged KV cache attention for efficient memory management.
tribev2 offers pretrained models to predict brain responses to videos using cortical mesh modeling. Supports video, text, and audio inputs with easy inference setup.
ByteDance’s In-Place TTT enables adaptive transformer inference by updating MLP down-projection weights in-place at test time, supporting long-context reasoning without extra modules.