Continuous perception from video streams is a complex problem that demands models capable of tracking and interpreting visual information over time. OmniStream tackles this by implementing a multi-frame transformer architecture that explicitly indexes patches across temporal frames, making it a compelling foundation for vision-language-action systems.
what OmniStream does: continuous stream perception with multi-frame transformers
OmniStream is a PyTorch-based implementation of a research paper focused on continuous stream perception, reconstruction, and action. The core model is a multi-frame transformer designed to process sequences of video frames or images represented as tensors of shape BxT, H, W, C (Batch, Time, Height, Width, Channels).
Under the hood, the model takes these input sequences and applies a transformer architecture that outputs several key components: the hidden states for each token (patch), CLS tokens representing the entire input, and importantly, patch-level indices that track temporal boundaries between frames. This patch_start_idx is crucial for downstream tasks that need to distinguish patches originating from different time steps.
The repo integrates tightly with the HuggingFace transformers ecosystem (version 4.56.1), leveraging their model loading and image processing utilities. It requires PyTorch 2.6.0 with CUDA 12.4 support for GPU acceleration.
Currently, the codebase offers inference-only functionality using pre-trained weights hosted on HuggingFace. Training scripts and vision-language or vision-language-action (VLM/VLA) model extensions are marked as TODO.
This design positions OmniStream as a foundational vision encoder for agentic AI systems that must handle continuous visual streams rather than isolated images.
what makes OmniStream technically interesting: temporal patch indexing and multi-frame attention
The standout feature of OmniStream is how it manages temporal continuity across frames with patch-level indexing. Instead of treating each frame independently or flattening the time dimension, the model tracks the start index of patches for each frame inside the concatenated token sequence.
This patch_start_idx allows explicit temporal frame boundary awareness within the transformer architecture. It means that downstream models or tasks can identify which patches belong to which frames, enabling fine-grained temporal reasoning.
The multi-frame transformer architecture used here processes the entire video sequence holistically, attending across both spatial and temporal dimensions. This contrasts with models that process frames separately or rely on recurrent mechanisms.
There is a tradeoff in this choice: the model’s complexity and memory footprint grow with the number of frames processed simultaneously, which could limit scalability for very long video streams. However, the benefit is a more integrated spatiotemporal representation.
The repository’s codebase is relatively clean given its research focus. The main model resides in model.py, with inference examples showing how to load and run the model with HuggingFace’s AutoImageProcessor. The use of the transformers library means the model benefits from a standardized API and ecosystem support.
One limitation is that the repo currently lacks training and fine-tuning code, so applying the model beyond inference requires additional work. Also, the VLM/VLA integration is not yet implemented, which means this foundation still needs extension for full agentic vision-language-action systems.
quick start: install and run inference with pre-trained weights
The installation process is straightforward if you have a CUDA 12.4-compatible GPU and conda for environment management. The README provides exact commands:
git clone https://github.com/Go2Heart/OmniStream.git
cd OmniStream
conda create -n omnistream python=3.10 -y
conda activate omnistream
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.56.1
For inference, you can load the pre-trained model from HuggingFace and run a dummy input tensor representing a batch of 16 frames of 512x512 RGB images:
from model import OmnistreamMultiFrameTransformer
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")
import torch
import numpy as np
model.eval()
fake_pixel = np.random.randn(16, 512, 512, 3) # BxT, H, W, C
fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda") # BxT, H, W, C
fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float() # B, T, H, W, C
with torch.no_grad():
output = model(**fake_input, return_dict=True)
print(output.keys())
print(output["last_hidden_state"].shape) # last layer's hidden states
print(output["hidden_states"][-1].shape) # last layer's hidden states
print(output["pooler_output"].shape) # cls token
print(output["patch_start_idx"]) # index of the first patch of each frame (1x[cls] + 4x[reg])
This example demonstrates the core output tensors you get from the model, including the CLS token embeddings and the temporal patch indexing.
verdict: a solid inference foundation for continuous vision transformers with room to grow
OmniStream offers a clean and focused implementation of a multi-frame transformer for continuous video stream perception, with an explicit temporal patch indexing mechanism that stands out. It’s a useful resource if you want to experiment with continuous spatiotemporal vision encoders or build on top of vision-language-action research.
The current limitations are clear: no training or fine-tuning code, no VLM/VLA integration yet, and a potential memory tradeoff when processing long sequences. If your goal is to run inference with pre-trained weights on video streams and explore the architecture, OmniStream fits well.
For practitioners looking to build agentic AI systems that need continuous perception from video inputs, this repo provides a practical starting point, especially if you are comfortable with the PyTorch and HuggingFace transformer stack.
Overall, OmniStream is worth exploring to understand how multi-frame transformers can segment and index patches across time, but it’s not yet a complete end-to-end solution for vision-language-action tasks.
Related Articles
- Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
- vLLM: Efficient large language model serving with paged attention and continuous batching — vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports qu
- Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
- PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
- Ollama: a unified CLI and API platform for local large language models — Ollama simplifies running and managing open-source large language models locally with a unified CLI and REST API, suppor
→ GitHub Repo: Go2Heart/OmniStream ⭐ 92 · Python