Proof that AGI-RUN, our on-device inference engine, beats SOTA, and what that speed buys an on-device agent.

Adam Awale, On-device research, AGI Inc., PhD candidate.

TL;DR

We're introducing AGI-RUN, AGI's own on-device inference engine. On a flagship Android phone, across six models, it averages ~5× faster prefill than each model's own native engine (up to 6.9×), and 9–32× faster than today's on-device "agent SDK" companies. Against the full field of SOTA inference frameworks on open source datasets, it wins 58 of 60 cases at a geomean of 3.3×, up to 9.9× on individual cells.
An on-device agent has a fixed time budget to respond, and a faster engine allows us to parse more context (the screen, the recent actions, the instructions) for the same budget.
The speedup comes from how we map the workload onto the chip, not from another round of quantization. It's the same cost-model-driven, memory-efficient mapping from our last PNM post, which is why it generalizes across models. More technical deep dive is coming soon in the next post.

Inference engines

An inference engine is the software that runs a model on a device: it takes the model and maps it onto the chips the device actually has, the CPU, the GPU, and increasingly the NPU. We built our own, AGI-RUN, and benchmarked it on a Samsung Galaxy S25 Ultra against six small vision-language models from the Qwen, Gemma, and Liquid families.

We compared it three ways:

Against each model's own native engine
Against the on-device SDK startups
Against the full field of SOTA frameworks

Most of the numbers below measure prefill. We focus there because edge workloads are prefill-heavy by design: on-device, decode is slow, so the workloads that matter read a lot and generate little.

Results

AGI-RUN vs. models' own native engines

AGI-RUN vs. each model's own native engine bar chart, synthetic prefill throughput per model with AGI-RUN speedup labeled — *Synthetic prefill throughput (tok/s, geomean over 64/256/1024), AGI-RUN vs MNN (Qwen), LiteRT (Gemma), LeapSDK (LFM).*

The hardest test is each model's native engine, the one its makers built or expect it to run on: MNN (Alibaba) for Qwen, LiteRT (Google) for Gemma, LeapSDK (Liquid AI) for LFM. We beat every one. AGI-RUN is ~5× faster on average, up to 6.9×, and the advantage grows with model size (4.8× to 6.9× on the larger Qwen and Gemma variants), because bigger models are more memory-bound, and our approach is built to use that memory efficiently.

AGI-RUN vs. the on-device inference companies

AGI-RUN vs. the on-device inference SDKs (Cactus, Nexa, RunAnywhere, Zetic), pooled by model family, with AGI-RUN's speedup over the best company labeled per family — *Synthetic prefill throughput (tok/s, geomean over 64/256/1024), pooled by model family. AGI-RUN vs Cactus, Nexa, RunAnywhere, Zetic. "n/s" = the SDK doesn't run that family.*

Next, the startups selling drop-in inference SDKs to app developers. The gap is not close (tokens/sec, higher is better):

Model	AGI-RUN (tok/s)	Best company (tok/s)	AGI-RUN lead
Qwen3.5-2B	632	Cactus 70	9×
Qwen3.5-4B	267	Cactus 28	10×
Gemma4-E2B	744	Zetic 23	32×
Gemma4-E4B-it	453	Cactus 15	30×
LFM2.5-VL-450M	3,292	Nexa 355	9×
LFM2-VL-1.6B	1,646	Nexa 65	26×

Two things stand out:

The 9–32× margin: these SDKs do most of their work at the kernel level, with no split of work across the NPU and CPU and no real memory-placement strategy.
The coverage holes: Nexa runs only the LFM family, Zetic only Qwen and Gemma-E2B, and the others drop models entirely (marked n/s).

We run all six, and the reason is architectural. Most engines dispatch each operation to a per-operator kernel library, so the day a model ships an operation that library doesn't have, it simply won't run. We build on low-level hardware intrinsics, the device's own instruction set, and let our cost models map the whole workload onto it. There is no unsupported-op wall, so we cover every model instead of only the ones someone hand-tuned for.

AGI-RUN vs. the full SOTA field

Across the full field of SOTA frameworks (llama.cpp on CPU, GPU, and NPU; MNN; MLC-LLM; LiteRT; mllm), run over six open source text and image datasets, AGI-RUN wins 58 of 60 model-and-workload cases, at a geomean of 3.3× over the best rival in each one, up to 9.9×. It is also the only engine besides llama.cpp and MNN to run every case; the others stall on model families they weren't hand-tuned for. The full per-case table, and how we get these numbers, is coming soon in a technical deep dive.

Win grid: AGI-RUN speedup over the best SOTA rival across 6 models and 10 workloads, 58 wins and 2 losses — *AGI-RUN's speedup over the fastest competing engine in each of 60 model-and-workload cases. Blue = AGI-RUN faster (≥1×); yellow = the two cases that go to LiteRT, both Gemma4-E2B image workloads.*

Where LiteRT is faster. Two of the sixty cases go to Google's LiteRT, both on the smallest model, Gemma4-E2B, running image workloads (VQAv2 and TextVQA).

LiteRT is Google's own engine for Gemma, and Google's engineers hand-tuned a path for Gemma's per-layer image embeddings with resources few teams can match. It doesn't survive the jump to the larger Gemma4-E4B: there the same two image workloads become memory-bound, and our placement is faster again (1.3× and 1.1×).

A kernel hand-tuned for one configuration beats us on exactly that configuration, and nowhere else.

Why it matters for AGI

An on-device agent runs against a latency budget: the user waits only so long for it to act, and everything the agent reads has to fit inside that window. Model throughput sets the ceiling on how much the agent can take in before it violates the budget.

Doubling the engine’s speed doubles the ceiling: more of the screen, action history, and state context the agent can hold. The same headroom can go to a bigger model instead: more parameters at the same wait time.

Our unique mapping allows us to hold this speed increase across six models and ten workloads. As we covered in our last post, AGI-RUN uses cost models to place each workload across the CPU, GPU, and NPU and to map memory efficiently, instead of hand-tuning a kernel per model.T

How AGI-RUN produces these numbers, the full setup and per-case data are coming soon in a technical deep dive.

Edge Cases Research Series

Our research series on on-device AI, compilers, and edge inference.

Subscribe to the AGI, Inc. calendar on lu.ma
Apply to present your on-device research

Work with us

We're hiring on-device research talent. If the problems in this post are the problems you work on, check out our open roles.

References

Models

Qwen (Alibaba): https://github.com/QwenLM
Gemma (Google): https://ai.google.dev/gemma
LFM / Liquid Foundation Models (Liquid AI): https://www.liquid.ai/

Datasets

LongBench (Bai et al., THUDM/Tsinghua, ACL 2024): https://arxiv.org/abs/2308.14508
2WikiMultihopQA (Ho et al., COLING 2020): https://arxiv.org/abs/2011.01060
TriviaQA (Joshi et al., UW/AI2, ACL 2017): https://arxiv.org/abs/1705.03551
DroidTask (in AutoDroid; Wen, Li et al., Tsinghua/AIR, MobiCom 2024): https://arxiv.org/abs/2308.15272
Persona-Chat (Zhang et al., MILA/FAIR, ACL 2018): https://arxiv.org/abs/1801.07243
VQAv2 (Goyal et al., CVPR 2017): https://arxiv.org/abs/1612.00837 · https://visualqa.org/
TextVQA (Singh et al., FAIR, CVPR 2019): https://arxiv.org/abs/1904.08920 · https://textvqa.org/

Inference engines & frameworks

llama.cpp (Georgi Gerganov / ggml-org): https://github.com/ggml-org/llama.cpp
MNN (Alibaba): https://github.com/alibaba/MNN
MLC-LLM (MLC project / Tianqi Chen, CMU / Apache TVM): https://github.com/mlc-ai/mlc-llm
LiteRT / LiteRT-LM (Google AI Edge): https://github.com/google-ai-edge/LiteRT · https://github.com/google-ai-edge/LiteRT-LM
mllm (UbiquitousLearning, BUPT/PKU): https://github.com/UbiquitousLearning/mllm
LeapSDK (Liquid AI): https://leap.liquid.ai

Hardware

Qualcomm Snapdragon 8 Elite for Galaxy / Samsung Galaxy S25 Ultra: https://www.qualcomm.com/snapdragon/device-finder/samsung-galaxy-s25-ultra
Qualcomm Hexagon NPU: https://www.qualcomm.com/processors/hexagon

AGI-RUN: Benchmarking AGI's on-device inference engine against the field, up to 9.9× faster than SOTA