Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCR

Falcon-Perception stands out by combining object detection, instance segmentation, and OCR into a single, dense autoregressive Transformer model that natively processes multimodal inputs. The key innovation is its FlexAttention-based hybrid attention masks, supporting bidirectional image attention alongside causal text, all compiled into efficient Triton kernels without custom CUDA code. This architecture enables continuous batching and high throughput on GPUs like Nvidia H100, with a paged key-value cache to handle very high-resolution images efficiently.

what Falcon-Perception does: a minimal, performant multimodal PyTorch inference engine

At its core, Falcon-Perception is a PyTorch-based inference engine for a dense autoregressive Transformer that integrates vision and language. Unlike typical pipelines that separate object detection, segmentation, and OCR into different models or stages, this repo implements a single Transformer model that directly attends to both image and text modalities.

The model is “natively multimodal,” meaning it uses a unified architecture to process images and natural language queries together. This setup enables tasks such as querying an image for objects, segmenting instances, or extracting text using OCR — all via natural language prompts.

Under the hood, the repo supports two backends: PyTorch on CUDA GPUs and MLX for Apple Silicon Macs, making it versatile across hardware. The inference engine is paged, using a continuous batching strategy alongside CUDA Graph capture for efficient execution. It handles large images by caching high-resolution image features and uses FlexAttention-based hybrid masks that combine bidirectional attention over images with causal attention over text.

The OCR variant extends the base model with layout detection and per-region text extraction, supporting document analysis use cases.

The architecture leverages the latest PyTorch capabilities, including Triton-compiled fused kernels for attention, eliminating the need for custom CUDA kernels while maintaining high efficiency.

what makes Falcon-Perception technically interesting: flexattention hybrid masks and paged continuous batching

The standout technical feature of Falcon-Perception is its use of FlexAttention-based hybrid attention masks. Traditional Transformer attention mechanisms either use full bidirectional attention or causal masks for autoregressive decoding. Here, Falcon-Perception fuses these approaches by using hybrid masks that allow bidirectional attention over image tokens and causal attention over text tokens within the same model.

This is implemented using PyTorch’s flex_attention with Triton kernel compilation, a recent advancement that enables highly optimized, fused attention computations without writing custom CUDA code. This approach simplifies maintenance and leverages PyTorch’s ongoing optimizations, which is a huge win for DX and performance.

The repo also implements a paged inference engine that supports continuous batching. This means multiple queries can be processed in a tightly packed batch, improving GPU utilization and latency. The key-value cache for attention is paged and supports high-resolution image features, enabling the model to handle large images that would otherwise overwhelm GPU memory.

On the performance side, the README reports first-run compilation and CUDA Graph capture taking roughly 10-30 seconds on an Nvidia H100 GPU. After that, subsequent inference runs achieve about 100ms for the prefill phase, 200ms for upsampling (which can be zero if cached), and 50ms for decoding around 10 tokens per instance. These numbers suggest the engine is well-optimized for real-time or near-real-time applications.

The codebase also supports a layout-aware OCR variant that adds document layout detection and text extraction per region, enhancing its utility for document analysis workflows.

The tradeoff here is the complexity of managing these hybrid attention masks and the paged cache, which adds architectural overhead. However, the benefit is a highly flexible and performant multimodal model inference engine that can be deployed on standard hardware without custom CUDA kernels.

quick start

installation

The package supports two backends: PyTorch (CUDA GPUs) and MLX (Apple Silicon Macs). A bare pip install auto-detects your platform, or you can pick an explicit extra.

Install command	Backend	When to use
`pip install -e .`	Auto-detect	Mac -> MLX, Linux -> Torch
`pip install -e ".[torch]"`	PyTorch + CUDA	GPU server or explicit Torch on Mac
`pip install -e ".[mlx]"`	MLX	Apple Silicon Mac
`pip install -e ".[ocr]"`	Torch + transformers	Layout-aware OCR (needs a layout detection model)
`pip install -e ".[dev]"`	–	Adds tensorboard, matplotlib, ipykernel
`pip install -e ".[server]"`	–	Adds FastAPI / Uvicorn for the paged inference server


### Installation

The package supports two backends: **PyTorch** (CUDA GPUs) and **MLX** (Apple Silicon Macs).
A bare `pip install` auto-detects your platform, or you can pick an explicit extra.

| Install command | Backend | When to use |
|---|---|---|
| `pip install -e .` | Auto-detect | Mac -> MLX, Linux -> Torch |
| `pip install -e ".[torch]"` | PyTorch + CUDA | GPU server or explicit Torch on Mac |
| `pip install -e ".[mlx]"` | MLX | Apple Silicon Mac |
| `pip install -e ".[ocr]"` | Torch + transformers | Layout-aware OCR (needs a layout detection model) |
| `pip install -e ".[dev]"` | -- | Adds tensorboard, matplotlib, ipykernel |
| `pip install -e ".[server]"` | -- | Adds FastAPI / Uvicorn for the paged inference server |

verdict

Falcon-Perception is a well-crafted inference engine for practitioners needing a unified multimodal Transformer that can handle detection, segmentation, and OCR tasks with natural language queries. Its architecture is cutting-edge in terms of attention mechanism design and GPU execution optimization, thanks to PyTorch’s flex_attention and Triton kernel fusion.

That said, the complexity of the paged inference engine and hybrid attention masks means this repo is best suited for developers comfortable with advanced Transformer internals and GPU programming nuances. It’s not a plug-and-play solution for casual use but a solid foundation for building production-grade multimodal AI systems.

If your projects demand efficient, scalable multimodal inference on Nvidia GPUs or Apple Silicon, and you want to avoid custom CUDA kernel maintenance, Falcon-Perception is worth exploring. For broader AI workflows or training, additional tooling would be needed.

In summary, Falcon-Perception fills a niche for performant, flexible multimodal inference engines with a focus on practical deployment and developer experience.

PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
Hands-on with YOLOv5: A practical deep dive into Ultralytics’ PyTorch vision model — YOLOv5 by Ultralytics offers an accessible, fast, and accurate PyTorch-based computer vision toolkit for object detectio
Hugging Face Transformers: a unified API for state-of-the-art AI models across modalities — Hugging Face Transformers offers a unified Python API to access over 1 million pretrained AI models for text, vision, an
vLLM: Efficient large language model serving with paged attention and continuous batching — vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports qu

→ GitHub Repo: tiiuae/Falcon-Perception ⭐ 605 · Python

Noureddine RAMDI / Falcon-Perception: a minimal multimodal PyTorch engine for object detection, segmentation, and OCR

what Falcon-Perception does: a minimal, performant multimodal PyTorch inference engine

what makes Falcon-Perception technically interesting: flexattention hybrid masks and paged continuous batching

quick start

installation

verdict

Related Articles