Streaming 3D scene reconstruction with LingBot-Map’s geometric context transformer

Streaming 3D reconstruction from long sequences of images is tough. You want fast, stable inference without memory blowing up as the sequence grows beyond thousands of frames. LingBot-Map tackles this by combining a specialized transformer architecture with smart memory caching strategies to deliver real-time 3D scene reconstruction on sequences exceeding 10,000 frames.

What LingBot-Map does: streaming 3D scene reconstruction with a geometric context transformer

LingBot-Map is a feed-forward 3D foundation model implemented in Python designed to reconstruct scenes continuously from sequential image data. It processes long video-like sequences frame-by-frame, producing dense 3D reconstructions in real time.

At the core is the Geometric Context Transformer architecture. This transformer variant is tailored for 3D reconstruction tasks and integrates several key concepts:

Coordinate grounding: The model explicitly anchors features to 3D coordinates.
Dense geometric cues: It processes depth and pose information densely to maintain spatial consistency.
Long-range drift correction: Using anchor context tokens, pose-reference windows, and a trajectory memory module, the model corrects accumulated pose drift over long sequences.

The system runs inference at about 20 frames per second on images sized 518×378. It handles sequences exceeding 10,000 frames by employing a paged KV cache attention mechanism through FlashInfer, which supports efficient caching and retrieval of key-value pairs in the transformer’s attention layers. This prevents the typical memory explosion that occurs when naively caching all frames.

LingBot-Map supports both interactive visualization via the Viser viewer and offline batch rendering pipelines. It implements a keyframe interval strategy to reduce memory usage by caching only every N-th frame while still producing outputs for all frames.

Technical strengths: paged KV cache attention and memory management for long sequences

The standout technical feature is LingBot-Map’s use of FlashInfer’s paged KV cache attention combined with a keyframe interval memory strategy. This approach lets the model maintain high throughput and low latency in streaming inference over very long sequences:

Paged KV cache attention: Unlike typical transformers that cache key-value pairs for every token in every frame, this method caches in pages, reducing memory footprint drastically.
Keyframe interval strategy: Instead of caching every frame, the model caches only selected keyframes (every N-th frame). Intermediate frames are predicted using cached keyframe information plus current input, balancing accuracy and memory demand.

This combination means LingBot-Map can sustain ~20 FPS inference over sequences longer than 10,000 frames on a single GPU, which is impressive given the typical quadratic memory growth in transformers with sequence length.

The codebase is well-structured in Python, leveraging PyTorch for model components and CUDA kernels compiled JIT via FlashInfer for performance-critical attention operations. The fallback to PyTorch native attention (SDPA) is available if FlashInfer is not installed, though with reduced efficiency.

The architectural design around the Geometric Context Transformer is also notable. It tightly couples pose and geometric information with learned memory modules to correct drift and maintain spatial coherence in the reconstruction. This design is quite opinionated but effective for the streaming 3D reconstruction use case.

Quick start: running the interactive demo

The repo provides a clear installation and quick start setup:

conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128

pip install -e .

pip install --index-url https://pypi.org/simple flashinfer-python

pip install -e ".[vis]"

Once installed, you can launch an interactive demo viewer with:

python demo.py --model_path /path/to/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

This command starts a Viser-powered interactive visualization at http://localhost:8080, showing the model’s reconstruction of the example courthouse scene.

The README also documents options for offline batch rendering suitable for longer sequences, which can be useful for production or evaluation scenarios.

Verdict: who should look at LingBot-Map and its tradeoffs

LingBot-Map is a solid choice if you’re working on real-time or streaming 3D reconstruction from video or image sequences and need to handle long sequences without memory bottlenecks. Its paged KV cache attention with keyframe interval caching is a clever engineering solution to the quadratic memory growth problem in transformers.

The code is Python-based, focusing on inference with a well-defined pipeline and visualization support. It’s less suited if you want to train or fine-tune models from scratch, as training scripts are not the main focus.

It requires a CUDA 12.8 environment with PyTorch 2.8.0 for best compatibility, mostly due to dependencies like NVIDIA Kaolin and FlashInfer. The fallback to PyTorch native attention is a helpful but less performant backup.

Overall, LingBot-Map’s combination of geometric context-aware transformer architecture with efficient streaming inference techniques makes it worth exploring if you care about 3D reconstruction throughput and memory efficiency in demanding sequence lengths.

→ GitHub Repo: Robbyant/lingbot-map ⭐ 5,673 · Python

Noureddine RAMDI / Streaming 3D scene reconstruction with LingBot-Map’s geometric context transformer

What LingBot-Map does: streaming 3D scene reconstruction with a geometric context transformer

Technical strengths: paged KV cache attention and memory management for long sequences

Quick start: running the interactive demo

Verdict: who should look at LingBot-Map and its tradeoffs