<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Inference on Noureddine RAMDI</title><link>https://ramdi.fr/tags/inference/</link><description>Recent content in Inference on Noureddine RAMDI</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 23 May 2026 20:41:27 +0000</lastBuildDate><atom:link href="https://ramdi.fr/tags/inference/index.xml" rel="self" type="application/rss+xml"/><item><title>Bytez: unified serverless inference across 220,000 AI models with a single API</title><link>https://ramdi.fr/github-stars/bytez-unified-serverless-inference-across-220000-ai-models-with-a-single-api/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/bytez-unified-serverless-inference-across-220000-ai-models-with-a-single-api/</guid><description>Bytez offers a unified API for over 220,000 AI models with serverless GPU orchestration, abstracting model diversity into a single inference platform accessible via one key.</description></item><item><title>Inside Mini-SGLang: A clear and modular Python LLM inference engine</title><link>https://ramdi.fr/github-stars/inside-mini-sglang-a-clear-and-modular-python-llm-inference-engine/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/inside-mini-sglang-a-clear-and-modular-python-llm-inference-engine/</guid><description>Mini-SGLang is a modular Python reimplementation of the SGLang LLM inference engine with production features like Radix Cache, chunked prefill, overlap scheduling, and tensor parallelism.</description></item><item><title>Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCR</title><link>https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/</guid><description>Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.</description></item><item><title>Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyond</title><link>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</guid><description>Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.</description></item><item><title>MicroGPT-C: Coordinating tiny GPT-2 models in C for edge logical reasoning</title><link>https://ramdi.fr/github-stars/microgpt-c-coordinating-tiny-gpt-2-models-in-c-for-edge-logical-reasoning/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/microgpt-c-coordinating-tiny-gpt-2-models-in-c-for-edge-logical-reasoning/</guid><description>MicroGPT-C uses a deterministic C scaffold to coordinate tiny GPT-2 models, achieving 90%+ accuracy on logic games with 8x memory compression and infinite sequence lengths.</description></item><item><title>TextGen: a portable zero-config local LLM runner with multi-backend and multimodal support</title><link>https://ramdi.fr/github-stars/textgen-a-portable-zero-config-local-llm-runner-with-multi-backend-and-multimodal-support/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/textgen-a-portable-zero-config-local-llm-runner-with-multi-backend-and-multimodal-support/</guid><description>TextGen offers a portable desktop app for local LLMs with zero telemetry and multi-backend support. Drop GGUF models in a folder and run with no complex setup. It features multimodal vision, file attachments, and OpenAI-compatible API.</description></item><item><title>vLLM: Efficient large language model serving with paged attention and continuous batching</title><link>https://ramdi.fr/github-stars/vllm-efficient-large-language-model-serving-with-paged-attention-and-continuous-batching/</link><pubDate>Sat, 02 May 2026 20:07:04 +0000</pubDate><guid>https://ramdi.fr/github-stars/vllm-efficient-large-language-model-serving-with-paged-attention-and-continuous-batching/</guid><description>vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports quantization, distributed inference, and an OpenAI-compatible API.</description></item></channel></rss>