<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Quantization on Noureddine RAMDI</title><link>https://ramdi.fr/tags/quantization/</link><description>Recent content in Quantization on Noureddine RAMDI</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 23 May 2026 20:41:27 +0000</lastBuildDate><atom:link href="https://ramdi.fr/tags/quantization/index.xml" rel="self" type="application/rss+xml"/><item><title>vLLM Compressor: Practical quantization and compression for large language model inference</title><link>https://ramdi.fr/github-stars/vllm-compressor-practical-quantization-and-compression-for-large-language-model-inference/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/vllm-compressor-practical-quantization-and-compression-for-large-language-model-inference/</guid><description>vLLM Compressor applies advanced quantization and compression techniques to large language models, enabling optimized inference without requiring full model definitions.</description></item><item><title>LiteRT-LM: Google's C++ library for efficient edge language model inference</title><link>https://ramdi.fr/github-stars/litert-lm-google-s-c-library-for-efficient-edge-language-model-inference/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/litert-lm-google-s-c-library-for-efficient-edge-language-model-inference/</guid><description>LiteRT-LM is a Google AI Edge C++ library for performant language model inference on edge devices with multi-language API support and easy CLI usage.</description></item><item><title>Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyond</title><link>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</link><pubDate>Mon, 04 May 2026 10:23:02 +0000</pubDate><guid>https://ramdi.fr/github-stars/lucebox-hub-hand-optimized-cuda-kernels-for-efficient-llm-inference-on-rtx-3090-and-beyond/</guid><description>Lucebox Hub optimizes LLM inference on consumer GPUs using a megakernel CUDA approach and speculative decoding, achieving high throughput on RTX 3090 and newer Nvidia GPUs.</description></item><item><title>A hands-on course for mastering large language models: fine-tuning, quantization, and tooling</title><link>https://ramdi.fr/github-stars/a-hands-on-course-for-mastering-large-language-models-fine-tuning-quantization-and-tooling/</link><pubDate>Sat, 02 May 2026 20:07:04 +0000</pubDate><guid>https://ramdi.fr/github-stars/a-hands-on-course-for-mastering-large-language-models-fine-tuning-quantization-and-tooling/</guid><description>Explore a comprehensive LLM course with practical notebooks on fine-tuning (QLoRA, DPO), quantization (GPTQ), and tools like AutoEval and LazyMergekit. Ideal for aspiring LLM engineers.</description></item><item><title>LlamaFactory: modular, extensible fine-tuning framework for large language models</title><link>https://ramdi.fr/github-stars/llamafactory-modular-extensible-fine-tuning-framework-for-large-language-models/</link><pubDate>Sat, 02 May 2026 20:07:04 +0000</pubDate><guid>https://ramdi.fr/github-stars/llamafactory-modular-extensible-fine-tuning-framework-for-large-language-models/</guid><description>LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, including LoRA, QLoRA, and reinforcement learning.</description></item></channel></rss>