Cuda on Noureddine RAMDI

Cuda on Noureddine RAMDIhttps://ramdi.fr/tags/cuda/Recent content in Cuda on Noureddine RAMDIHugoenSat, 23 May 2026 20:41:27 +0000A structured GPU performance engineering curriculum from fundamentals to frontier labshttps://ramdi.fr/github-stars/a-structured-gpu-performance-engineering-curriculum-from-fundamentals-to-frontier-labs/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/a-structured-gpu-performance-engineering-curriculum-from-fundamentals-to-frontier-labs/A curated GPU performance engineering curriculum focusing on CUDA, kernel optimization, and NVIDIA architectures, guiding engineers from fundamentals to advanced production techniques.DeepSpeed: scalable deep learning optimization with extensible hardware supporthttps://ramdi.fr/github-stars/deepspeed-scalable-deep-learning-optimization-with-extensible-hardware-support/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/deepspeed-scalable-deep-learning-optimization-with-extensible-hardware-support/DeepSpeed is a Python library that optimizes large-scale deep learning training with multi-hardware support and JIT CUDA extensions. Explore its architecture, strengths, and quick installation.DualSDF: A two-level signed distance function approach for semantic 3D shape manipulationhttps://ramdi.fr/github-stars/dualsdf-a-two-level-signed-distance-function-approach-for-semantic-3d-shape-manipulation/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/dualsdf-a-two-level-signed-distance-function-approach-for-semantic-3d-shape-manipulation/DualSDF separates coarse semantic structure from fine geometric detail in 3D shape modeling using a two-level signed distance function. It enables intuitive shape edits with pretrained models and a WebGL demo.GS-Playground: High-throughput photorealistic simulation for vision-based robot learninghttps://ramdi.fr/github-stars/gs-playground-high-throughput-photorealistic-simulation-for-vision-based-robot-learning/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/gs-playground-high-throughput-photorealistic-simulation-for-vision-based-robot-learning/GS-Playground combines 3D Gaussian Splatting rendering with a velocity-impulse physics engine to enable large-scale visual reinforcement learning at up to 10^4 FPS. Preview release with core simulation API and demos.Lynx: modular personalized video generation with dual adapters on a frozen diffusion transformerhttps://ramdi.fr/github-stars/lynx-modular-personalized-video-generation-with-dual-adapters-on-a-frozen-diffusion-transformer/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/lynx-modular-personalized-video-generation-with-dual-adapters-on-a-frozen-diffusion-transformer/Lynx generates personalized videos from a single image using a frozen Diffusion Transformer with ID and Ref adapters. This modular design balances fidelity and efficiency.TurboOCR: a GPU-accelerated OCR server optimized for raw pixel input and high throughputhttps://ramdi.fr/github-stars/turboocr-a-gpu-accelerated-ocr-server-optimized-for-raw-pixel-input-and-high-throughput/Tue, 05 May 2026 13:37:39 +0000https://ramdi.fr/github-stars/turboocr-a-gpu-accelerated-ocr-server-optimized-for-raw-pixel-input-and-high-throughput/TurboOCR is a C++/CUDA OCR server leveraging TensorRT FP16 for high throughput and low latency, featuring a zero-decode pixel pipeline and multi-protocol API.NVIDIA Warp: JIT-compiling Python for CUDA-powered differentiable physicshttps://ramdi.fr/github-stars/nvidia-warp-jit-compiling-python-for-cuda-powered-differentiable-physics/Mon, 04 May 2026 10:23:03 +0000https://ramdi.fr/github-stars/nvidia-warp-jit-compiling-python-for-cuda-powered-differentiable-physics/NVIDIA Warp lets you write Python functions JIT-compiled into CUDA kernels for GPU-accelerated differentiable physics and ML integration, simplifying GPU programming in Python.AniGen: GPU-accelerated 3D animation generation with Python and CUDAhttps://ramdi.fr/github-stars/anigen-gpu-accelerated-3d-animation-generation-with-python-and-cuda/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/anigen-gpu-accelerated-3d-animation-generation-with-python-and-cuda/AniGen is a Linux-only Python project for 3D animation generation using NVIDIA GPUs and CUDA. It integrates PyTorch, spconv, and pytorch3d with a smooth setup script for complex dependencies.DIMO: Distilling Diverse 3D Motion Priors for Arbitrary Object Motion Synthesishttps://ramdi.fr/github-stars/dimo-distilling-diverse-3d-motion-priors-for-arbitrary-object-motion-synthesis/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/dimo-distilling-diverse-3d-motion-priors-for-arbitrary-object-motion-synthesis/DIMO distills motion priors from text-conditioned and multi-view video models into a shared latent space, enabling diverse 3D motion generation for arbitrary objects using 3D Gaussian splatting and 4D rendering.Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCRhttps://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyondhttps://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.OpenPose: real-time multi-person 2D pose estimation with constant-time body detectionhttps://ramdi.fr/github-stars/openpose-real-time-multi-person-2d-pose-estimation-with-constant-time-body-detection/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/openpose-real-time-multi-person-2d-pose-estimation-with-constant-time-body-detection/OpenPose is a C++ library for real-time multi-person 2D pose estimation using Part Affinity Fields, enabling constant inference time for body detection regardless of person count.Streaming 3D scene reconstruction with LingBot-Map’s geometric context transformerhttps://ramdi.fr/github-stars/streaming-3d-scene-reconstruction-with-lingbot-maps-geometric-context-transformer/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/streaming-3d-scene-reconstruction-with-lingbot-maps-geometric-context-transformer/LingBot-Map performs streaming 3D reconstruction from long image sequences at ~20 FPS using a geometric context transformer and paged KV cache attention for efficient memory management.Cupid: feed-forward 3D reconstruction with joint camera pose estimation from single imageshttps://ramdi.fr/github-stars/cupid-feed-forward-3d-reconstruction-with-joint-camera-pose-estimation-from-single-images/Mon, 04 May 2026 10:23:01 +0000https://ramdi.fr/github-stars/cupid-feed-forward-3d-reconstruction-with-joint-camera-pose-estimation-from-single-images/Cupid is a feed-forward 3D reconstruction model that jointly estimates camera pose and reconstructs 3D objects from single 2D images, outputting textured 3D meshes and radiance fields in seconds.MR.ScaleMaster: heterogeneous multi-robot monocular SLAM fusion via Sim(3) optimizationhttps://ramdi.fr/github-stars/mr-scalemaster-heterogeneous-multi-robot-monocular-slam-fusion-via-sim-3-optimization/Mon, 04 May 2026 10:23:01 +0000https://ramdi.fr/github-stars/mr-scalemaster-heterogeneous-multi-robot-monocular-slam-fusion-via-sim-3-optimization/MR.ScaleMaster fuses scale-ambiguous monocular SLAM trajectories from multiple robots using Sim(3) graph optimization, enabling heterogeneous SLAM frontends and consistent global maps.DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernelshttps://ramdi.fr/github-stars/deepep-optimizing-communication-for-large-mixture-of-experts-models-with-cuda-kernels/Sat, 02 May 2026 20:07:04 +0000https://ramdi.fr/github-stars/deepep-optimizing-communication-for-large-mixture-of-experts-models-with-cuda-kernels/DeepEP is a CUDA-based communication library designed for Mixture-of-Experts models, delivering high-throughput GPU kernels with NVLink and RDMA support for efficient expert parallelism.