Cactus

A hybrid edge-cloud AI engine for mobile devices & wearables.

Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one engine for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for text, speech, and vision.
└─────────────────┘     
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph ensures 10x lower RAM 
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc)
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Quants   │ ←── Cactus Quants at 4-bit uniform matches f16.
└─────────────────┘  
         │
┌─────────────────┐
│Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus.
└─────────────────┘

Quick Demo (Mac)

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus run

Cactus Engine

#include "cactus_engine.h"

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
    false
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr,          // user data
    nullptr,          // pcm audio buffer
    0                 // pcm buffer size
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include "cactus_graph.h"

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

Benchmarks

LLM: Gemma-4-E2B-CQ4 (CPU, no speculative decode), 1k-prefill tps / 100-decode tps
VLM: Gemma-4-E2B-CQ4 (NPU prefill, CPU decode), 256px input, latency / decode tps
Transcribe: Parakeet-TDT-0.6B-CQ4 (NPU prefill, CPU decode), 20s audio, latency / decode tps
Missing latency == no NPU support for device

Device	LLM	VLM	Transcribe	RAM
Mac M4 Pro	324 / 39	1.2s / 48	0.2s / 10.6M	1385 MB
iPad/Mac M3	-	-	-	-
iPhone 17 Pro	-	-	-	-
iPhone 13 Mini	-	-	-	-
Galaxy S26	248 / 21	- / 16	- / 5.7M	-
Galaxy A17 5G	-	-	-	-
Pixel 10 Pro	-	-	-	-
Pixel 6a	-	-	-	-
Raspberry Pi 5	-	-	-	-

Supported Models

Any HuggigFace model can be converted using cactus convert [HF-Name], though experimental.
Liquid, Gemma. whisper. parakeet and Qwen model families are especially tested.
Some models have been pre-uploaded here, just run cactus download [HF-Name].
cactus run [HF-Name] albeit first downloads or convert the model if not found.

Learn More

Reference	Language	Description
Cactus Engine	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, vector index, cloud handoff
Cactus Graph	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Cactus Kernels	C++	ARM NEON SIMD kernels for matmul, attention, convolution, quantization, DSP, image processing
Cactus Quants	C++	Rotation-and-codebook quantization from 4-bit to 1-bit for all weight tensors
Cactus Hybrid	C/Python	Route hard queries to the cloud automatically based on local model confidence
Cactus Transpiler	Python	Convert any PyTorch model to a Cactus runtime graph for on-device inference
Python Package	Python	Python package and CLI

Bindings

Using this repo

┌────────────────────────────────────────────────────────────────────────────────┐
│                                                                                │
│ Step 0: if on Linux (Ubuntu/Debian)                                            │
│ sudo apt-get install python3.12 python3.12-venv python3-pip cmake              │
│   build-essential libcurl4-openssl-dev                                         │
│                                                                                │
│ Step 1: clone and setup                                                        │
│ git clone https://github.com/cactus-compute/cactus && cd cactus                │
│ source ./setup                                                                 │
│                                                                                │
│ Step 2: use the commands                                                       │
│────────────────────────────────────────────────────────────────────────────────│
│                                                                                │
│  cactus auth                         manage cloud API key                      │
│    --status                          show key status                           │
│    --clear                           remove saved key                          │
│                                                                                │
│  cactus run <model|path>             run a model (downloads if needed)         │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --platform cpu|apple              target accelerator (default: cpu)         │
│    --image <path>                    image file for VLM inference              │
│    --audio <path>                    audio file for audio chat                 │
│    --system <prompt>                 system prompt                             │
│    --prompt <text>                   send prompt immediately                   │
│    --thinking                        enable thinking/reasoning mode            │
│    --token <token>                   HuggingFace token (gated models)          │
│    --reconvert                       force local convert+transpile fallback    │
│                                                                                │
│  cactus transcribe [model]           transcribe audio with a model             │
│    --file <audio.wav>                audio file to transcribe (required)       │
│    --language <code>                 language code (default: en)               │
│    --token <token>                   HuggingFace token (gated models)          │
│    --reconvert                       force reconversion from source            │
│                                                                                │
│  cactus download <model>             download a pre-built bundle               │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --platform cpu|apple              target accelerator (default: cpu)         │
│    --token <token>                   HuggingFace token                         │
│                                                                                │
│  cactus convert <model> [dir]        convert HuggingFace weights to CQ         │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --token <token>                   HuggingFace token                         │
│    --reconvert                       force build from source                   │
│    --lora <path>                     merge a LoRA adapter before converting    │
│                                                                                │
│  cactus transpile <model>            build a runnable bundle from CQ weights   │
│    --weights-dir <path>              path to CQ weights (default: lookup)      │
│    --task <auto|...>                 force task type (default: auto)           │
│    --artifact-dir <path>             bundle output (default: weights/<model>)  │
│                                                                                │
│  cactus serve [model]                OpenAI-compatible local HTTP server       │
│    --host <addr>                     bind address (default: 127.0.0.1)         │
│    --port <port>                     port (default: 8080)                      │
│                                                                                │
│  cactus list                         list local converted weights and bundles  │
│                                                                                │
│  cactus build                        build cactus libraries                    │
│    --apple                           Apple (iOS/macOS)                         │
│    --android                         Android                                   │
│    --python                          shared lib for Python FFI                 │
│                                                                                │
│  cactus test                         run the test suite                        │
│    --component <name>                kernels | graph | engine | all            │
│                                      (default: all)                            │
│    --model <hf-id>                   default: LiquidAI/LFM2-VL-450M            │
│    --transcription-model <hf-id>     default: openai/whisper-base              │
│    --suite <name>                    run a single test suite (resolved         │
│                                      across components; e.g. performance       │
│                                      → kernels + graph, llm → engine)          │
│    --list                            list components and suites                │
│    --ios                             run on connected iPhone                   │
│    --android                         run on connected Android                  │
│    --enable-telemetry                send cloud telemetry (off by default)     │
│                                                                                │
│  cactus clean                        delete build artifacts                    │
│  cactus --help                       show this help                            │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Maintaining Organisations

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

N/B: Scroll all the way up and click the shields link for resources!

Name		Name	Last commit message	Last commit date
Latest commit History 829 Commits
.github		.github
android		android
apple		apple
assets		assets
bindings		bindings
blog		blog
cactus-engine		cactus-engine
cactus-graph		cactus-graph
cactus-kernels		cactus-kernels
docs		docs
python		python
scripts		scripts
.gitignore		.gitignore
CACTUS_VERSION		CACTUS_VERSION
CONTRIBUTING.md		CONTRIBUTING.md
DCO.md		DCO.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
llms.txt		llms.txt
mkdocs.yml		mkdocs.yml
setup		setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cactus

Quick Demo (Mac)

Cactus Engine

Cactus Graph

Benchmarks

Supported Models

Learn More

Bindings

Using this repo

Maintaining Organisations

Citation

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cactus

Quick Demo (Mac)

Cactus Engine

Cactus Graph

Benchmarks

Supported Models

Learn More

Bindings

Using this repo

Maintaining Organisations

Citation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages