In-Place TTT from ByteDance Seed tackles a subtle but important limitation in modern large language model (LLM) inference: the static “train then deploy” paradigm. Transformer LLMs are typically trained once and then deployed with frozen weights, which limits their ability to adapt to new or evolving information during inference. This repo offers a clever architectural approach that updates the fast weights of the model’s MLP down-projection layers in-place during test time, allowing the model to dynamically adjust while generating text.
What in-place test-time training for transformers means
In-Place TTT is a test-time training (TTT) method designed specifically for transformer-based LLMs. Instead of adding side modules or external memory to enable adaptation, it works entirely within the existing transformer architecture by updating the down-projection weights in the MLP blocks on the fly.
The core idea is to maintain “fast weights” — temporary, quickly updated model parameters — tied to the MLP down-projection layers. These weights are updated chunk-wise during inference, aligned with the next-token prediction loss signal, allowing the model to refine its internal representation as it processes long sequences.
This chunk-wise updating supports very long context windows, scaling from 4K tokens up to 256K tokens, which is a challenging regime for standard transformers. The approach builds on VeOmni, a modular transformer training framework that supports Fully Sharded Data Parallel (FSDP2) training, checkpoint conversion, and evaluation pipelines.
The repo is implemented in Python, leveraging PyTorch with FlashAttention for efficient attention computation. It is designed to support large models such as Qwen3-8B and LLaMA-3.1-8B, providing full training, checkpoint management, and evaluation tooling.
Architectural elegance and tradeoffs
What sets In-Place TTT apart is its minimalistic but effective mechanism: updating the MLP down-projection weights directly within the transformer layers without additional architectural complexity. Most adaptive inference methods either add external memory, side networks, or require architectural modifications. Here, the fast-weight updates happen “in place,” preserving the original model structure.
This design reduces overhead and complexity in deployment, as there’s no need to manage extra modules or memory buffers. The fast weights are updated in sync with the next-token prediction chunks, making the process tightly coupled with the core language modeling task.
The tradeoff is that this method demands careful chunk-wise scheduling to maintain stability and effectiveness. It also assumes access to the MLP down-projection weights and may not generalize easily to all transformer variants without adaptation.
The repo also integrates with VeOmni’s checkpoint conversion and RULER evaluation system, which supports rigorous testing of long-context reasoning capabilities.
From a code quality perspective, the repo is well-structured around the PyTorch ecosystem, with dependencies on FlashAttention for GPU-efficient attention computation and various data handling and training tools. The use of FSDP2 implies it targets distributed training at scale.
Quick start
To get started with In-Place TTT, follow these commands exactly as provided:
# Step 1: Install PyTorch and FlashAttention
pip3 install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip3 install flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
rm flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
# Step 2: Install VeOmni from the validated commit
pip3 install "veomni @ git+https://github.com/ByteDance-Seed/VeOmni.git@9b91e164bea9e17f17ed490aab5e076c2335ca25"
# Step 3: Install remaining dependencies
pip3 install liger-kernel
pip3 install byted-wandb torchdata blobfile datasets diffusers tiktoken timm
pip3 install transformers==4.57.3
pip3 install opt_einsum einops
pip3 uninstall -y byted-wandb wandb
pip3 install byted-wandb
# Step 4: (Optional) Verify VeOmni installation
python3 - <<'PY'
import json, pathlib, veomni
p = pathlib.Path(veomni.__file__).resolve().parents[1] / "veomni-0.1.0.dist-info" / "direct_url.json"
print("veomni file:", veomni.__file__)
print("direct_url:", json.loads(p.read_text()) if p.exists() else "not found")
PY
Note that data preparation is not included in the repo — you need to supply your own processed datasets. The repo expects plaintext datasets loaded via VeOmni’s iterable dataset interface.
Training is launched with scripts like:
bash train.sh tasks/train_torch.py configs/pretrain/qwen3_longct.yaml \
--data.train_path /path/to/your_data \
--train.output_dir /path/to/your_output_dir
verdict
In-Place TTT is a solid contribution for ML engineers and researchers interested in adaptive inference for transformers. It offers a neat architectural solution to the test-time training challenge without bloating the model with extra modules.
Its main appeal is for those working on very long context windows and who want to push transformer models to adapt dynamically during inference. The method is well integrated with modern tooling like VeOmni and FSDP2, which suits large-scale training environments.
Limitations include the complexity of managing chunk-wise updates and the dependency on specific model architectures. Also, the absence of data processing scripts means some setup work is required before training.
Overall, if you are exploring ways to break the static deployment mold of LLMs and want a clean, in-place fast-weight update method, this repo is worth a close look.
Related Articles
- vLLM: Efficient large language model serving with paged attention and continuous batching — vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports qu
- A hands-on course for mastering large language models: fine-tuning, quantization, and tooling — Explore a comprehensive LLM course with practical notebooks on fine-tuning (QLoRA, DPO), quantization (GPTQ), and tools
- AutoGPT: A modular platform for continuous AI agents and workflow automation — AutoGPT is a Python-based platform for building and managing continuous AI agents that automate workflows, featuring a m
- LlamaFactory: modular, extensible fine-tuning framework for large language models — LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, incl
- PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
→ GitHub Repo: ByteDance-Seed/In-Place-TTT ⭐ 195 · Python