Noureddine RAMDI / ElatoAI: running real-time voice AI agents on $5 ESP32 microcontrollers via edge function streaming

Created Mon, 04 May 2026 10:23:01 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

akdeb/ElatoAI

Running real-time voice AI agents on an ultra-low-cost microcontroller might sound impossible. Yet ElatoAI pulls it off by offloading all heavy AI processing to edge functions, using efficient audio streaming and modern web infrastructure. This opens the door to sub-$10 voice AI devices with capabilities typically reserved for cloud servers.

what elatoai does and its architecture

ElatoAI is an open-source framework for building real-time voice AI agents on ESP32 microcontrollers — those $5 chips known for IoT projects but limited in memory and compute. The key idea is to treat the ESP32 as a dedicated audio I/O and codec device rather than running AI models locally.

Under the hood, ElatoAI captures audio on the ESP32, encodes it using the Opus codec at 12kbps and 24kHz sampling rate, then streams it over secure WebSocket connections to edge functions running on Deno or Cloudflare Workers. These edge functions act as proxies, routing audio to over 100 different speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) APIs including OpenAI’s Realtime API, Gemini Live, xAI Grok, ElevenLabs, and Hume AI.

The architecture includes a Next.js frontend deployed on Vercel, which provides the user interface and interacts with Supabase for authentication and conversation storage. Firmware updates for the ESP32 devices are delivered over-the-air (OTA), ensuring easy maintenance.

This design effectively decouples AI inference from the device hardware constraints. The microcontroller only deals with audio capture, Opus encoding/decoding, and maintaining a secure WebSocket connection — no need for external PSRAM or heavy local compute.

The system supports up to 20-minute uninterrupted conversations with global round-trip latency under 2 seconds. Cold start time connecting to an edge server is around 3-4 seconds.

technical strengths and tradeoffs

ElatoAI’s standout technical feature is its audio streaming pipeline using the Opus codec at a low bitrate (12kbps) with a high sample rate (24kHz). This balance achieves high audio clarity while minimizing bandwidth and processing overhead, critical for microcontroller-based streaming over the internet.

Routing audio through secure WebSockets to edge functions is a clean architectural choice. It allows integration with a wide variety of AI models and APIs without exposing the device or complicating firmware. The edge functions handle the orchestration and proxying, which eases adding or swapping AI providers.

From a code perspective, the project uses TypeScript across firmware and backend components, which improves maintainability and developer experience. The codebase is modular and documented, making it easier to contribute or extend.

The tradeoff here is reliance on network connectivity and cloud infrastructure. Without stable internet and responsive edge servers, the experience degrades. Also, the cold start time of a few seconds is noticeable but acceptable for many real-world applications.

Another consideration is that the ESP32 itself doesn’t run any AI inference, so this isn’t a standalone embedded AI system. It’s more a voice I/O endpoint that acts as a gateway to cloud AI services.

explore the project

Since no explicit quickstart commands are listed, the best way to approach ElatoAI is by exploring the repository and documentation.

The README provides an overview of the architecture and key concepts like the multi-model STT/LLM/TTS pipeline and the use of Opus codec streaming.

Look into the firmware/ directory for the ESP32 code handling audio capture, Opus encoding, and WebSocket communication.

The edge-functions/ folder contains the Deno and Cloudflare Worker scripts that proxy audio and messages to various AI APIs.

The frontend/ directory holds the Next.js app for user interaction and conversation management.

The docs also mention OTA firmware updates and Supabase integration for authentication and conversation storage, which are worth understanding for end-to-end deployment.

verdict

ElatoAI is a solid example of how to build voice AI experiences on extremely constrained hardware by offloading AI inference to edge infrastructure. It shines in scenarios where you want a $5 or less device to participate in real-time AI-powered conversations with sub-2 second latency and long continuous sessions.

The tradeoff is clear: it requires reliable networking and cloud services, so it’s not suitable for offline or isolated environments. Also, the ESP32 is just an audio gateway, so AI capabilities depend entirely on the edge models.

For developers building voice AI toys, companions, or embedded voice devices where cost and power budgets are tight, ElatoAI offers a practical, well-architected starting point. Its modular TypeScript codebase and use of modern cloud and web tech mean it can be extended or adapted fairly easily.

Worth understanding even if you don’t adopt it wholesale, especially if you’re working with embedded voice interfaces or edge AI architectures.


→ GitHub Repo: akdeb/ElatoAI ⭐ 1,690 · TypeScript