Orion: Direct access to Apple Neural Engine for on-device LLM training

Orion takes a different approach to running and training language models on Apple Silicon: it bypasses Apple’s official CoreML stack entirely to communicate directly with the Apple Neural Engine (ANE) through private, undocumented frameworks. This unconventional method unlocks capabilities Apple typically reserves for inference only, allowing both inference and fine-tuning of small large language models (LLMs) like GPT-2 124M and Stories110M entirely on-device without any GPU or cloud dependency.

The architecture behind direct ANE access and on-device training

At its core, Orion is an Objective-C runtime designed specifically to interact with the ANE via private Apple frameworks such as _ANEClient, _ANECompiler, and using the MIL intermediate representation (IR). Instead of relying on CoreML, which abstracts and limits access to the Neural Engine, Orion compiles model graphs directly into optimized MIL programs that run on the ANE.

The architecture has several distinct layers:

CLI layer: Provides command-line interfaces to interact with model configs and control training or inference runs.
Compiler: Transforms the higher-level graph IR into MIL, optimizing it for the ANE hardware.
Core runtime: Manages program caching, execution, and efficient input/output handling using IOSurface-backed fp16 tensor layouts.

A key innovation is Orion’s use of delta compilation. Normally, training or fine-tuning on ANE requires recompiling programs each time weights change, which is slow and limited by Apple’s ~119 compile-per-process limit. Delta compilation bypasses this by patching weight blobs (BLOBFILE) directly, reloading updated weights in about 494ms instead of a full 4,200ms recompilation. This reduces training overhead by roughly 8.5x, enabling practical on-device fine-tuning.

Training runs leverage a hybrid approach: the ANE handles forward and backward passes, while the CPU performs weight updates and the Adam optimizer steps. This division of labor keeps training stable over 1,000+ steps without NaNs or memory leaks.

Why Orion’s approach stands out and its tradeoffs

Orion’s main technical strength is its direct use of private Apple frameworks to access the ANE. This is rare because Apple does not expose these APIs publicly, and most ML tooling on macOS or iOS uses CoreML or other official APIs. By going under the hood, Orion offers capabilities that CoreML does not support, such as on-device training of LLMs.

The delta compilation technique is another highlight. Reloading weights without recompilation avoids the costly compile time and the hard cap on compile calls. This is a clever workaround that makes training feasible on hardware designed primarily for inference.

Benchmarks from the README illustrate impressive numbers:

170+ tokens per second inference
3.8x faster training via delta compile
~19 TFLOPS fp16 and 38 TOPS INT8 compute performance
72 programs compiled once at startup in ~4.5 seconds
Stable training over 22 minutes with 1,000+ steps, loss decreasing steadily, zero NaNs, and no memory leaks

However, there are clear tradeoffs and limitations:

Using private frameworks means the code is fragile against OS updates or changes in Apple’s internal APIs.
The project targets small LLMs only (GPT-2 124M, Stories110M), not large-scale models requiring more memory.
The hybrid CPU-ANE training loop is complex and bespoke, potentially limiting portability or ease of integration.
Running on macOS 15+ with Apple Silicon M1 or later restricts the hardware environment.

The codebase is surprisingly clean for such a low-level project, with clear separation between compilation, runtime, and CLI layers. The use of IOSurface to handle fp16 tensors efficiently shows attention to performance and memory management.

Installation prerequisites and how to get started

The README provides explicit requirements rather than step-by-step install commands:

## Requirements

- macOS 15+ (Sequoia) on Apple Silicon (M1 or later)
- Xcode Command Line Tools
- Python 3.10+ with `torch`, `transformers` (weight conversion only — not needed at runtime)

This setup reflects the specialized nature of the project: it requires the latest macOS and hardware, along with developer tools to build and run the Objective-C runtime. Python and ML libraries are only for initial weight conversion, not for runtime execution.

To explore the project, start by examining the README.md for detailed explanations of the CLI commands and model config formats. The code is organized to separate the compiler and runtime logic, which helps in understanding how the MIL IR is generated and executed on the ANE.

Verdict: who should look at Orion?

Orion is a niche but technically fascinating project for developers interested in low-level ML runtime engineering on Apple Silicon. If you want to experiment with on-device training of small LLMs and don’t mind working with private, undocumented Apple frameworks, this repo offers valuable insights and a working proof of concept.

The tradeoffs are clear: fragility against OS changes and hardware constraints limit its use to research or experimental projects rather than production deployments. Still, the delta compilation approach and direct ANE access are worth understanding for anyone working on ML acceleration or runtime optimizations on Apple devices.

For practitioners who want to push the boundaries of what the Apple Neural Engine can do beyond inference, Orion provides a rare, hands-on example with solid engineering and measurable performance gains. Just be prepared to dive deep into Apple’s private APIs and accept the maintenance burden that comes with it.

In short, Orion is not for casual users or large-scale model training, but it is a solid resource for specialized developers exploring the frontier of on-device ML training on Apple Silicon.

A-MEM: dynamic semantic memory management for LLM agents inspired by Zettelkasten — A-MEM is a Python agentic memory system that dynamically organizes LLM agent memories using semantic embeddings and auto
TensorFlow: a versatile platform powering machine learning from research to production — TensorFlow is a comprehensive open-source machine learning platform with stable multi-language APIs and broad hardware s
Keras 3: Multi-backend deep learning framework simplifying model development across JAX, TensorFlow, and PyTorch — Keras 3 introduces a multi-backend architecture supporting JAX, TensorFlow, PyTorch, and OpenVINO, enabling flexible, ac
LlamaFactory: modular, extensible fine-tuning framework for large language models — LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, incl

→ GitHub Repo: mechramc/Orion ⭐ 85 · Objective-C

Noureddine RAMDI / Orion: Direct access to Apple Neural Engine for on-device LLM training

The architecture behind direct ANE access and on-device training

Why Orion’s approach stands out and its tradeoffs

Installation prerequisites and how to get started

Verdict: who should look at Orion?

Related Articles