Inference on Noureddine RAMDI

Inference on Noureddine RAMDIhttps://ramdi.fr/tags/inference/Recent content in Inference on Noureddine RAMDIHugoenSat, 23 May 2026 20:41:27 +0000Bytez: unified serverless inference across 220,000 AI models with a single APIhttps://ramdi.fr/github-stars/bytez-unified-serverless-inference-across-220000-ai-models-with-a-single-api/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/bytez-unified-serverless-inference-across-220000-ai-models-with-a-single-api/Bytez offers a unified API for over 220,000 AI models with serverless GPU orchestration, abstracting model diversity into a single inference platform accessible via one key.Inside Mini-SGLang: A clear and modular Python LLM inference enginehttps://ramdi.fr/github-stars/inside-mini-sglang-a-clear-and-modular-python-llm-inference-engine/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/inside-mini-sglang-a-clear-and-modular-python-llm-inference-engine/Mini-SGLang is a modular Python reimplementation of the SGLang LLM inference engine with production features like Radix Cache, chunked prefill, overlap scheduling, and tensor parallelism.Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCRhttps://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/falcon-perception-a-minimal-multimodal-pytorch-engine-for-object-detection-segmentation-and-ocr/Falcon-Perception is a PyTorch engine for multimodal autoregressive Transformers handling detection, segmentation, and OCR with FlexAttention and efficient caching.Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyondhttps://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.MicroGPT-C: Coordinating tiny GPT-2 models in C for edge logical reasoninghttps://ramdi.fr/github-stars/microgpt-c-coordinating-tiny-gpt-2-models-in-c-for-edge-logical-reasoning/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/microgpt-c-coordinating-tiny-gpt-2-models-in-c-for-edge-logical-reasoning/MicroGPT-C uses a deterministic C scaffold to coordinate tiny GPT-2 models, achieving 90%+ accuracy on logic games with 8x memory compression and infinite sequence lengths.TextGen: a portable zero-config local LLM runner with multi-backend and multimodal supporthttps://ramdi.fr/github-stars/textgen-a-portable-zero-config-local-llm-runner-with-multi-backend-and-multimodal-support/Mon, 04 May 2026 10:23:02 +0000https://ramdi.fr/github-stars/textgen-a-portable-zero-config-local-llm-runner-with-multi-backend-and-multimodal-support/TextGen offers a portable desktop app for local LLMs with zero telemetry and multi-backend support. Drop GGUF models in a folder and run with no complex setup. It features multimodal vision, file attachments, and OpenAI-compatible API.vLLM: Efficient large language model serving with paged attention and continuous batchinghttps://ramdi.fr/github-stars/vllm-efficient-large-language-model-serving-with-paged-attention-and-continuous-batching/Sat, 02 May 2026 20:07:04 +0000https://ramdi.fr/github-stars/vllm-efficient-large-language-model-serving-with-paged-attention-and-continuous-batching/vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports quantization, distributed inference, and an OpenAI-compatible API.