Noureddine RAMDI / How video-use turns AI agents into transcript-driven video editors

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

browser-use/video-use

Video editing traditionally means working with massive amounts of raw frames and audio data — a heavy bottleneck for AI-driven automation. video-use takes a different path: it treats the transcript of the audio as the primary editing surface, reducing the data load from millions of video frames to a few kilobytes of structured text. This approach mirrors browser-use’s pattern of preferring DOM as a structured source over screenshots, but applied to video editing. The result is an AI-powered pipeline that edits video by reasoning over transcripts rather than raw pixels.

What video-use does and its architecture

video-use is an open-source Python tool designed to turn AI coding agents into practical video editors. It integrates tightly with ElevenLabs Scribe, an audio transcription service that provides word-level timestamps and speaker diarization. This transcription output — about 12KB per source — becomes the core data the AI agent reasons over, instead of processing raw video frames that could amount to 45 million tokens.

The pipeline follows a clear sequence: first, the audio is transcribed and packed into a structured format; then the large language model (LLM) reasons about editing decisions based on this transcript; next, an Edit Decision List (EDL) is produced; ffmpeg executes the actual rendering of video cuts and transitions; finally, a self-evaluation loop runs to catch any jarring visual jumps or audio pops before the edit is finalized.

This architecture results in a significant efficiency gain because the AI spends its compute budget on reasoning over structured text rather than sifting through noisy frame data. The system only generates visual composites (PNGs) on demand at decision points, avoiding the cost of full-frame dumps.

Outputs from the pipeline, including the final video and metadata, persist in a project.md file to maintain session continuity and enable iterative editing.

The entire stack is Python-based, using ElevenLabs Scribe for transcription, LLMs for reasoning (compatible with Claude Code, Codex, Hermes, Openclaw), and ffmpeg for rendering. The codebase organizes editing logic into scripts under the helpers/ directory, reflecting a modular design.

Technical strengths and tradeoffs

The most notable strength of video-use is its transcript-as-surface editing paradigm. By operating on structured text rather than raw frames, it reduces input complexity drastically — from an estimated 45 million tokens of video frames to roughly 12KB of packed transcript data per source. This tradeoff cuts down on memory and compute requirements and improves the AI’s reasoning focus.

The integration with ElevenLabs Scribe is key because it provides precise word-level timestamps and speaker diarization, enabling the system to make editing decisions with fine temporal granularity and contextual speaker awareness.

Another strong point is the self-evaluation loop. After rendering, the system checks for visual and audio discontinuities at cut boundaries, allowing up to three re-render attempts to smooth out artifacts. This is a practical solution to the common problem of jump cuts or audio pops in automated editing.

The architecture also benefits from the separation of concerns: transcription, reasoning, rendering, and evaluation are discrete steps. This makes debugging and extension easier.

On the downside, this approach relies heavily on the quality of the transcription. If ElevenLabs Scribe mis-transcribes audio or speaker diarization is off, the editing logic might produce suboptimal cuts. Also, the system doesn’t handle raw visual content analysis beyond the rendered PNG composites at decision points, which could limit its effectiveness for visually complex edits.

Finally, the tool depends on external LLMs and ElevenLabs API keys, which means usage costs and API rate limits are considerations for production use.

Quick start

The repo provides explicit instructions for setup and usage, which are crucial for such an integrated system:

Set up https://github.com/browser-use/video-use for me.

Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.

After initial setup, you point the agent at a folder of raw takes and invoke it through an LLM-based agent CLI such as Claude, Codex, or Hermes:

cd /path/to/your/videos
claude    # or codex, hermes, etc.

Within the session, you can then issue commands like:

edit these into a launch video

The agent inventories the sources, proposes an editing strategy, waits for your approval, then produces the final edited video in an edit/final.mp4 file alongside the sources. All outputs are stored in the /edit/ directory, keeping the skill directory clean.

Manual installation instructions are also provided for those who prefer to set up dependencies and API keys by hand.

verdict

video-use offers a thoughtful approach to AI video editing by focusing on transcript-driven decisions rather than raw video frames. This design is particularly relevant for projects where audio content and dialogue are central, such as interviews, tutorials, or presentations.

The tradeoff is clear: it won’t replace manual editing for visually complex or stylistically nuanced videos where visual cues dominate. Its reliance on transcription quality and external API services means users need to manage these dependencies carefully.

However, for developers looking to experiment with LLM-driven video editing pipelines or automate straightforward edits based on dialogue, video-use provides a surprisingly clean and modular codebase. The self-evaluation loop and structured pipeline reflect practical engineering that anticipates real-world video editing challenges.

If your workflows involve frequent video edits tied closely to spoken content, and you want to explore AI-assisted editing without drowning in raw frame data, video-use is worth a close look.


→ GitHub Repo: browser-use/video-use ⭐ 6,169 · Python