Noureddine RAMDI / LlamaFactory: modular, extensible fine-tuning framework for large language models

Created Sat, 02 May 2026 20:07:04 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

hiyouga/LlamaFactory

Fine-tuning large language models (LLMs) is a complex and rapidly evolving challenge, with new models and methods emerging frequently. LlamaFactory addresses this head-on by providing a comprehensive, modular framework that supports fine-tuning over 100 LLMs, including LLaMA, Mistral, Mixtral-MoE, and Qwen3. Its design balances extensibility with usability, offering both zero-code CLI and Web UI interfaces to cover a wide spectrum of users — from researchers to developers integrating fine-tuning pipelines.

what LlamaFactory does and how it’s built

LlamaFactory is a Python-based framework built to simplify, scale, and standardize the fine-tuning of large language models. At its core, it supports a variety of fine-tuning methods such as supervised fine-tuning, reward modeling, Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Knowledge Transfer Optimization (KTO). It also integrates advanced algorithms like GaLore, DoRA, and PiSSA, which are recent additions in the LLM training research community.

The architecture is modular: you can add new LLM models, training algorithms, or optimization techniques as separate components, making it future-proof. Under the hood, it leverages PyTorch and CUDA for GPU acceleration and includes optimizations like FlashAttention-2 and the Liger Kernel for faster attention mechanisms.

For model efficiency, LlamaFactory supports LoRA and QLoRA — parameter-efficient fine-tuning techniques — combined with various quantization strategies to reduce memory footprint and accelerate training and inference. It also offers multiple inference interfaces: an OpenAI-style API, a Gradio UI, and a CLI, all powered by efficient backends such as vLLM or SGLang workers.

The project is actively maintained with “Day-N Support” to quickly integrate cutting-edge models and methods. Documentation is extensive, and cloud training options are available for users without local GPU resources.

modular design and rapid integration of new methods

What sets LlamaFactory apart is its clean separation of concerns and extensibility. The codebase is organized into modules for models, fine-tuning algorithms, resource optimizations, and inference backends. This makes adding new LLM architectures or training methods straightforward without disrupting existing functionality.

The tradeoff here is complexity: supporting 100+ models and multiple fine-tuning strategies means dependency management can be tricky, especially on Windows where users must manually install PyTorch with CUDA and specialized libraries like bitsandbytes for quantization. The Docker image helps standardize the environment for Linux users but might not cover all edge cases.

The code quality is solid, with clear interfaces and modular classes. Algorithms like GaLore and PiSSA are implemented alongside classical methods, allowing users to experiment with state-of-the-art approaches without hunting down separate repos. The integration of FlashAttention-2 and other kernel-level optimizations shows attention to performance bottlenecks.

While the UI options (CLI and Web UI) make it accessible, power users can dive into configuration files and extend training scripts. The design pattern favors convention over configuration but remains flexible for customization.

quick start

The project includes detailed installation instructions, emphasizing the importance of environment setup. Here are the core steps to get started from source:

git clone --depth 1 https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
pip install -r requirements/metrics.txt

Optional dependencies for metrics and deepspeed can be installed with:

pip install -e . && pip install -r requirements/metrics.txt -r requirements/deepspeed.txt

For users on Windows, manual installation of PyTorch with CUDA support is required, along with a special bitsandbytes build for quantized LoRA (QLoRA) support:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -c "import torch; print(torch.cuda.is_available())"

The Docker image provides a ready-to-run environment for Linux users with CUDA 12.4, PyTorch 2.6.0, and FlashAttention 2.7.4:

docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest

For running the Web UI, the project recommends using uv to create an isolated Python environment:

uv run llamafactory-cli webui

This setup covers most use cases from quick experimentations to production fine-tuning pipelines.

verdict

LlamaFactory is a solid, practitioner-oriented framework that balances breadth and depth in LLM fine-tuning. Its modular architecture and rapid integration of new models and algorithms make it a valuable tool for researchers and developers who need to stay current with fast-moving LLM research.

The tradeoff is the complexity of managing dependencies and environment setup, especially for Windows users and advanced quantization features. However, the comprehensive documentation and Docker support ease this burden considerably.

If you’re working with large language models and want a one-stop-shop for fine-tuning with support for advanced methods like PPO, DPO, LoRA, and quantization, LlamaFactory is worth a look. It’s not a plug-and-play solution for casual users, but for those who want to dig into fine-tuning research and maintain flexibility for new models and techniques, it offers a robust foundation.


→ GitHub Repo: hiyouga/LlamaFactory ⭐ 70,618 · Python