A curated GPU performance engineering curriculum focusing on CUDA, kernel optimization, and NVIDIA architectures, guiding engineers from fundamentals to advanced production techniques.
DeepSpeed is a Python library that optimizes large-scale deep learning training with multi-hardware support and JIT CUDA extensions. Explore its architecture, strengths, and quick installation.
DualSDF separates coarse semantic structure from fine geometric detail in 3D shape modeling using a two-level signed distance function. It enables intuitive shape edits with pretrained models and a WebGL demo.
GS-Playground combines 3D Gaussian Splatting rendering with a velocity-impulse physics engine to enable large-scale visual reinforcement learning at up to 10^4 FPS. Preview release with core simulation API and demos.
Lynx generates personalized videos from a single image using a frozen Diffusion Transformer with ID and Ref adapters. This modular design balances fidelity and efficiency.
TurboOCR is a C++/CUDA OCR server leveraging TensorRT FP16 for high throughput and low latency, featuring a zero-decode pixel pipeline and multi-protocol API.
NVIDIA Warp lets you write Python functions JIT-compiled into CUDA kernels for GPU-accelerated differentiable physics and ML integration, simplifying GPU programming in Python.
AniGen is a Linux-only Python project for 3D animation generation using NVIDIA GPUs and CUDA. It integrates PyTorch, spconv, and pytorch3d with a smooth setup script for complex dependencies.
DIMO distills motion priors from text-conditioned and multi-view video models into a shared latent space, enabling diverse 3D motion generation for arbitrary objects using 3D Gaussian splatting and 4D rendering.
Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.
Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.
OpenPose is a C++ library for real-time multi-person 2D pose estimation using Part Affinity Fields, enabling constant inference time for body detection regardless of person count.
LingBot-Map performs streaming 3D reconstruction from long image sequences at ~20 FPS using a geometric context transformer and paged KV cache attention for efficient memory management.
Cupid is a feed-forward 3D reconstruction model that jointly estimates camera pose and reconstructs 3D objects from single 2D images, outputting textured 3D meshes and radiance fields in seconds.
MR.ScaleMaster fuses scale-ambiguous monocular SLAM trajectories from multiple robots using Sim(3) graph optimization, enabling heterogeneous SLAM frontends and consistent global maps.
DeepEP is a CUDA-based communication library designed for Mixture-of-Experts models, delivering high-throughput GPU kernels with NVLink and RDMA support for efficient expert parallelism.