<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Cuda on Noureddine RAMDI</title><link>https://ramdi.fr/tags/cuda/</link><description>Recent content in Cuda on Noureddine RAMDI</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 23 May 2026 20:41:27 +0000</lastBuildDate><atom:link href="https://ramdi.fr/tags/cuda/index.xml" rel="self" type="application/rss+xml"/><item><title>A structured GPU performance engineering curriculum from fundamentals to frontier labs</title><link>https://ramdi.fr/github-stars/a-structured-gpu-performance-engineering-curriculum-from-fundamentals-to-frontier-labs/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/a-structured-gpu-performance-engineering-curriculum-from-fundamentals-to-frontier-labs/</guid><description>A curated GPU performance engineering curriculum focusing on CUDA, kernel optimization, and NVIDIA architectures, guiding engineers from fundamentals to advanced production techniques.</description></item><item><title>DeepSpeed: scalable deep learning optimization with extensible hardware support</title><link>https://ramdi.fr/github-stars/deepspeed-scalable-deep-learning-optimization-with-extensible-hardware-support/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/deepspeed-scalable-deep-learning-optimization-with-extensible-hardware-support/</guid><description>DeepSpeed is a Python library that optimizes large-scale deep learning training with multi-hardware support and JIT CUDA extensions. Explore its architecture, strengths, and quick installation.</description></item><item><title>DualSDF: A two-level signed distance function approach for semantic 3D shape manipulation</title><link>https://ramdi.fr/github-stars/dualsdf-a-two-level-signed-distance-function-approach-for-semantic-3d-shape-manipulation/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/dualsdf-a-two-level-signed-distance-function-approach-for-semantic-3d-shape-manipulation/</guid><description>DualSDF separates coarse semantic structure from fine geometric detail in 3D shape modeling using a two-level signed distance function. It enables intuitive shape edits with pretrained models and a WebGL demo.</description></item><item><title>GS-Playground: High-throughput photorealistic simulation for vision-based robot learning</title><link>https://ramdi.fr/github-stars/gs-playground-high-throughput-photorealistic-simulation-for-vision-based-robot-learning/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/gs-playground-high-throughput-photorealistic-simulation-for-vision-based-robot-learning/</guid><description>GS-Playground combines 3D Gaussian Splatting rendering with a velocity-impulse physics engine to enable large-scale visual reinforcement learning at up to 10^4 FPS. Preview release with core simulation API and demos.</description></item><item><title>Lynx: modular personalized video generation with dual adapters on a frozen diffusion transformer</title><link>https://ramdi.fr/github-stars/lynx-modular-personalized-video-generation-with-dual-adapters-on-a-frozen-diffusion-transformer/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/lynx-modular-personalized-video-generation-with-dual-adapters-on-a-frozen-diffusion-transformer/</guid><description>Lynx generates personalized videos from a single image using a frozen Diffusion Transformer with ID and Ref adapters. This modular design balances fidelity and efficiency.</description></item><item><title>TurboOCR: a GPU-accelerated OCR server optimized for raw pixel input and high throughput</title><link>https://ramdi.fr/github-stars/turboocr-a-gpu-accelerated-ocr-server-optimized-for-raw-pixel-input-and-high-throughput/</link><pubDate>Tue, 05 May 2026 13:37:39 +0000</pubDate><guid>https://ramdi.fr/github-stars/turboocr-a-gpu-accelerated-ocr-server-optimized-for-raw-pixel-input-and-high-throughput/</guid><description>TurboOCR is a C++/CUDA OCR server leveraging TensorRT FP16 for high throughput and low latency, featuring a zero-decode pixel pipeline and multi-protocol API.</description></item><item><title>NVIDIA Warp: JIT-compiling Python for CUDA-powered differentiable physics</title><link>https://ramdi.fr/github-stars/nvidia-warp-jit-compiling-python-for-cuda-powered-differentiable-physics/</link><pubDate>Mon, 04 May 2026 10:23:03 +0000</pubDate><guid>https://ramdi.fr/github-stars/nvidia-warp-jit-compiling-python-for-cuda-powered-differentiable-physics/</guid><description>NVIDIA Warp lets you write Python functions JIT-compiled into CUDA kernels for GPU-accelerated differentiable physics and ML integration, simplifying GPU programming in Python.</description></item><item><title>AniGen: GPU-accelerated 3D animation generation with Python and CUDA</title><link>https://ramdi.fr/github-stars/anigen-gpu-accelerated-3d-animation-generation-with-python-and-cuda/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/anigen-gpu-accelerated-3d-animation-generation-with-python-and-cuda/</guid><description>AniGen is a Linux-only Python project for 3D animation generation using NVIDIA GPUs and CUDA. It integrates PyTorch, spconv, and pytorch3d with a smooth setup script for complex dependencies.</description></item><item><title>DIMO: Distilling Diverse 3D Motion Priors for Arbitrary Object Motion Synthesis</title><link>https://ramdi.fr/github-stars/dimo-distilling-diverse-3d-motion-priors-for-arbitrary-object-motion-synthesis/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/dimo-distilling-diverse-3d-motion-priors-for-arbitrary-object-motion-synthesis/</guid><description>DIMO distills motion priors from text-conditioned and multi-view video models into a shared latent space, enabling diverse 3D motion generation for arbitrary objects using 3D Gaussian splatting and 4D rendering.</description></item><item><title>Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCR</title><link>https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/</guid><description>Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.</description></item><item><title>Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyond</title><link>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</guid><description>Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.</description></item><item><title>OpenPose: real-time multi-person 2D pose estimation with constant-time body detection</title><link>https://ramdi.fr/github-stars/openpose-real-time-multi-person-2d-pose-estimation-with-constant-time-body-detection/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/openpose-real-time-multi-person-2d-pose-estimation-with-constant-time-body-detection/</guid><description>OpenPose is a C++ library for real-time multi-person 2D pose estimation using Part Affinity Fields, enabling constant inference time for body detection regardless of person count.</description></item><item><title>Streaming 3D scene reconstruction with LingBot-Map’s geometric context transformer</title><link>https://ramdi.fr/github-stars/streaming-3d-scene-reconstruction-with-lingbot-maps-geometric-context-transformer/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/streaming-3d-scene-reconstruction-with-lingbot-maps-geometric-context-transformer/</guid><description>LingBot-Map performs streaming 3D reconstruction from long image sequences at ~20 FPS using a geometric context transformer and paged KV cache attention for efficient memory management.</description></item><item><title>Cupid: feed-forward 3D reconstruction with joint camera pose estimation from single images</title><link>https://ramdi.fr/github-stars/cupid-feed-forward-3d-reconstruction-with-joint-camera-pose-estimation-from-single-images/</link><pubDate>Mon, 04 May 2026 10:23:01 +0000</pubDate><guid>https://ramdi.fr/github-stars/cupid-feed-forward-3d-reconstruction-with-joint-camera-pose-estimation-from-single-images/</guid><description>Cupid is a feed-forward 3D reconstruction model that jointly estimates camera pose and reconstructs 3D objects from single 2D images, outputting textured 3D meshes and radiance fields in seconds.</description></item><item><title>MR.ScaleMaster: heterogeneous multi-robot monocular SLAM fusion via Sim(3) optimization</title><link>https://ramdi.fr/github-stars/mr-scalemaster-heterogeneous-multi-robot-monocular-slam-fusion-via-sim-3-optimization/</link><pubDate>Mon, 04 May 2026 10:23:01 +0000</pubDate><guid>https://ramdi.fr/github-stars/mr-scalemaster-heterogeneous-multi-robot-monocular-slam-fusion-via-sim-3-optimization/</guid><description>MR.ScaleMaster fuses scale-ambiguous monocular SLAM trajectories from multiple robots using Sim(3) graph optimization, enabling heterogeneous SLAM frontends and consistent global maps.</description></item><item><title>DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernels</title><link>https://ramdi.fr/github-stars/deepep-optimizing-communication-for-large-mixture-of-experts-models-with-cuda-kernels/</link><pubDate>Sat, 02 May 2026 20:07:04 +0000</pubDate><guid>https://ramdi.fr/github-stars/deepep-optimizing-communication-for-large-mixture-of-experts-models-with-cuda-kernels/</guid><description>DeepEP is a CUDA-based communication library designed for Mixture-of-Experts models, delivering high-throughput GPU kernels with NVLink and RDMA support for efficient expert parallelism.</description></item></channel></rss>