A hybrid edge-cloud AI engine for mobile devices & wearables.
- Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16
- Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
- Multimodal: one engine for speech, vision, and language models
- Cloud fallback: automatically route requests to cloud models if needed
- Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.
┌─────────────────┐
│ Cactus Engine │ ←── OpenAI-compatible APIs for text, speech, and vision.
└─────────────────┘
│
┌─────────────────┐
│ Cactus Graph │ ←── Zero-copy computation graph ensures 10x lower RAM
└─────────────────┘
│
┌─────────────────┐
│ Cactus Kernels │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc)
└─────────────────┘
│
┌─────────────────┐
│ Cactus Quants │ ←── Cactus Quants at 4-bit uniform matches f16.
└─────────────────┘
│
┌─────────────────┐
│Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus.
└─────────────────┘
- Step 1:
brew install cactus-compute/cactus/cactus - Step 2:
cactus run
#include "cactus_engine.h"
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
false
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr, // user data
nullptr, // pcm audio buffer
0 // pcm buffer size
);Example response from Gemma3-270m
{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}#include "cactus_graph.h"
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset(); - LLM: Gemma-4-E2B-CQ4 (CPU, no speculative decode), 1k-prefill tps / 100-decode tps
- VLM: Gemma-4-E2B-CQ4 (NPU prefill, CPU decode), 256px input, latency / decode tps
- Transcribe: Parakeet-TDT-0.6B-CQ4 (NPU prefill, CPU decode), 20s audio, latency / decode tps
- Missing latency == no NPU support for device
| Device | LLM | VLM | Transcribe | RAM |
|---|---|---|---|---|
| Mac M4 Pro | 324 / 39 | 1.2s / 48 | 0.2s / 10.6M | 1385 MB |
| iPad/Mac M3 | - | - | - | - |
| iPhone 17 Pro | - | - | - | - |
| iPhone 13 Mini | - | - | - | - |
| Galaxy S26 | 248 / 21 | - / 16 | - / 5.7M | - |
| Galaxy A17 5G | - | - | - | - |
| Pixel 10 Pro | - | - | - | - |
| Pixel 6a | - | - | - | - |
| Raspberry Pi 5 | - | - | - | - |
- Any HuggigFace model can be converted using
cactus convert [HF-Name], though experimental. - Liquid, Gemma. whisper. parakeet and Qwen model families are especially tested.
- Some models have been pre-uploaded here, just run
cactus download [HF-Name]. cactus run [HF-Name]albeit first downloads or convert the model if not found.
| Reference | Language | Description |
|---|---|---|
| Cactus Engine | C | Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, vector index, cloud handoff |
| Cactus Graph | C++ | Tensor operations, matrix multiplication, attention, normalization, activation functions |
| Cactus Kernels | C++ | ARM NEON SIMD kernels for matmul, attention, convolution, quantization, DSP, image processing |
| Cactus Quants | C++ | Rotation-and-codebook quantization from 4-bit to 1-bit for all weight tensors |
| Cactus Hybrid | C/Python | Route hard queries to the cloud automatically based on local model confidence |
| Cactus Transpiler | Python | Convert any PyTorch model to a Cactus runtime graph for on-device inference |
| Python Package | Python | Python package and CLI |
┌────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Step 0: if on Linux (Ubuntu/Debian) │
│ sudo apt-get install python3.12 python3.12-venv python3-pip cmake │
│ build-essential libcurl4-openssl-dev │
│ │
│ Step 1: clone and setup │
│ git clone https://github.com/cactus-compute/cactus && cd cactus │
│ source ./setup │
│ │
│ Step 2: use the commands │
│────────────────────────────────────────────────────────────────────────────────│
│ │
│ cactus auth manage cloud API key │
│ --status show key status │
│ --clear remove saved key │
│ │
│ cactus run <model|path> run a model (downloads if needed) │
│ --bits 1|2|3|4 CQ quantization (default: 4) │
│ --platform cpu|apple target accelerator (default: cpu) │
│ --image <path> image file for VLM inference │
│ --audio <path> audio file for audio chat │
│ --system <prompt> system prompt │
│ --prompt <text> send prompt immediately │
│ --thinking enable thinking/reasoning mode │
│ --token <token> HuggingFace token (gated models) │
│ --reconvert force local convert+transpile fallback │
│ │
│ cactus transcribe [model] transcribe audio with a model │
│ --file <audio.wav> audio file to transcribe (required) │
│ --language <code> language code (default: en) │
│ --token <token> HuggingFace token (gated models) │
│ --reconvert force reconversion from source │
│ │
│ cactus download <model> download a pre-built bundle │
│ --bits 1|2|3|4 CQ quantization (default: 4) │
│ --platform cpu|apple target accelerator (default: cpu) │
│ --token <token> HuggingFace token │
│ │
│ cactus convert <model> [dir] convert HuggingFace weights to CQ │
│ --bits 1|2|3|4 CQ quantization (default: 4) │
│ --token <token> HuggingFace token │
│ --reconvert force build from source │
│ --lora <path> merge a LoRA adapter before converting │
│ │
│ cactus transpile <model> build a runnable bundle from CQ weights │
│ --weights-dir <path> path to CQ weights (default: lookup) │
│ --task <auto|...> force task type (default: auto) │
│ --artifact-dir <path> bundle output (default: weights/<model>) │
│ │
│ cactus serve [model] OpenAI-compatible local HTTP server │
│ --host <addr> bind address (default: 127.0.0.1) │
│ --port <port> port (default: 8080) │
│ │
│ cactus list list local converted weights and bundles │
│ │
│ cactus build build cactus libraries │
│ --apple Apple (iOS/macOS) │
│ --android Android │
│ --python shared lib for Python FFI │
│ │
│ cactus test run the test suite │
│ --component <name> kernels | graph | engine | all │
│ (default: all) │
│ --model <hf-id> default: LiquidAI/LFM2-VL-450M │
│ --transcription-model <hf-id> default: openai/whisper-base │
│ --suite <name> run a single test suite (resolved │
│ across components; e.g. performance │
│ → kernels + graph, llm → engine) │
│ --list list components and suites │
│ --ios run on connected iPhone │
│ --android run on connected Android │
│ --enable-telemetry send cloud telemetry (off by default) │
│ │
│ cactus clean delete build artifacts │
│ cactus --help show this help │
│ │
└────────────────────────────────────────────────────────────────────────────────┘
- Cactus Compute, Inc. (YC S25)
- UCLA's BruinAI
- Char (YC S25)
- Yale's AI Society
- National University of Singapore's AI Society
- UC Irvine's AI@UCI
- Imperial College's AI Society
- University of Pennsylvania's AI@Penn
- University of Michigan Ann-Arbor MSAIL
- University of Colorado Boulder's AI Club
If you use Cactus in your research, please cite it as follows:
@software{cactus,
title = {Cactus: AI Inference Engine for Phones & Wearables},
author = {Ndubuaku, Henry and Cactus Team},
url = {https://github.com/cactus-compute/cactus},
year = {2025}
}N/B: Scroll all the way up and click the shields link for resources!
