Skip to content

cactus-compute/cactus

Repository files navigation

Cactus

Logo

Docs Website GitHub HuggingFace Reddit Blog

A hybrid edge-cloud AI engine for mobile devices & wearables.

  • Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16
  • Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
  • Multimodal: one engine for speech, vision, and language models
  • Cloud fallback: automatically route requests to cloud models if needed
  • Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.
┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for text, speech, and vision.
└─────────────────┘     
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph ensures 10x lower RAM 
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc)
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Quants   │ ←── Cactus Quants at 4-bit uniform matches f16.
└─────────────────┘  
         │
┌─────────────────┐
│Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus.
└─────────────────┘

Quick Demo (Mac)

  • Step 1: brew install cactus-compute/cactus/cactus
  • Step 2: cactus run

Cactus Engine

#include "cactus_engine.h"

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
    false
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr,          // user data
    nullptr,          // pcm audio buffer
    0                 // pcm buffer size
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include "cactus_graph.h"

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset(); 

Benchmarks

  • LLM: Gemma-4-E2B-CQ4 (CPU, no speculative decode), 1k-prefill tps / 100-decode tps
  • VLM: Gemma-4-E2B-CQ4 (NPU prefill, CPU decode), 256px input, latency / decode tps
  • Transcribe: Parakeet-TDT-0.6B-CQ4 (NPU prefill, CPU decode), 20s audio, latency / decode tps
  • Missing latency == no NPU support for device
Device LLM VLM Transcribe RAM
Mac M4 Pro 324 / 39 1.2s / 48 0.2s / 10.6M 1385 MB
iPad/Mac M3 - - - -
iPhone 17 Pro - - - -
iPhone 13 Mini - - - -
Galaxy S26 248 / 21 - / 16 - / 5.7M -
Galaxy A17 5G - - - -
Pixel 10 Pro - - - -
Pixel 6a - - - -
Raspberry Pi 5 - - - -

Supported Models

  • Any HuggigFace model can be converted using cactus convert [HF-Name], though experimental.
  • Liquid, Gemma. whisper. parakeet and Qwen model families are especially tested.
  • Some models have been pre-uploaded here, just run cactus download [HF-Name].
  • cactus run [HF-Name] albeit first downloads or convert the model if not found.

Learn More

Reference Language Description
Cactus Engine C Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, vector index, cloud handoff
Cactus Graph C++ Tensor operations, matrix multiplication, attention, normalization, activation functions
Cactus Kernels C++ ARM NEON SIMD kernels for matmul, attention, convolution, quantization, DSP, image processing
Cactus Quants C++ Rotation-and-codebook quantization from 4-bit to 1-bit for all weight tensors
Cactus Hybrid C/Python Route hard queries to the cloud automatically based on local model confidence
Cactus Transpiler Python Convert any PyTorch model to a Cactus runtime graph for on-device inference
Python Package Python Python package and CLI

Bindings

Using this repo

┌────────────────────────────────────────────────────────────────────────────────┐
│                                                                                │
│ Step 0: if on Linux (Ubuntu/Debian)                                            │
│ sudo apt-get install python3.12 python3.12-venv python3-pip cmake              │
│   build-essential libcurl4-openssl-dev                                         │
│                                                                                │
│ Step 1: clone and setup                                                        │
│ git clone https://github.com/cactus-compute/cactus && cd cactus                │
│ source ./setup                                                                 │
│                                                                                │
│ Step 2: use the commands                                                       │
│────────────────────────────────────────────────────────────────────────────────│
│                                                                                │
│  cactus auth                         manage cloud API key                      │
│    --status                          show key status                           │
│    --clear                           remove saved key                          │
│                                                                                │
│  cactus run <model|path>             run a model (downloads if needed)         │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --platform cpu|apple              target accelerator (default: cpu)         │
│    --image <path>                    image file for VLM inference              │
│    --audio <path>                    audio file for audio chat                 │
│    --system <prompt>                 system prompt                             │
│    --prompt <text>                   send prompt immediately                   │
│    --thinking                        enable thinking/reasoning mode            │
│    --token <token>                   HuggingFace token (gated models)          │
│    --reconvert                       force local convert+transpile fallback    │
│                                                                                │
│  cactus transcribe [model]           transcribe audio with a model             │
│    --file <audio.wav>                audio file to transcribe (required)       │
│    --language <code>                 language code (default: en)               │
│    --token <token>                   HuggingFace token (gated models)          │
│    --reconvert                       force reconversion from source            │
│                                                                                │
│  cactus download <model>             download a pre-built bundle               │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --platform cpu|apple              target accelerator (default: cpu)         │
│    --token <token>                   HuggingFace token                         │
│                                                                                │
│  cactus convert <model> [dir]        convert HuggingFace weights to CQ         │
│    --bits 1|2|3|4                    CQ quantization (default: 4)              │
│    --token <token>                   HuggingFace token                         │
│    --reconvert                       force build from source                   │
│    --lora <path>                     merge a LoRA adapter before converting    │
│                                                                                │
│  cactus transpile <model>            build a runnable bundle from CQ weights   │
│    --weights-dir <path>              path to CQ weights (default: lookup)      │
│    --task <auto|...>                 force task type (default: auto)           │
│    --artifact-dir <path>             bundle output (default: weights/<model>)  │
│                                                                                │
│  cactus serve [model]                OpenAI-compatible local HTTP server       │
│    --host <addr>                     bind address (default: 127.0.0.1)         │
│    --port <port>                     port (default: 8080)                      │
│                                                                                │
│  cactus list                         list local converted weights and bundles  │
│                                                                                │
│  cactus build                        build cactus libraries                    │
│    --apple                           Apple (iOS/macOS)                         │
│    --android                         Android                                   │
│    --python                          shared lib for Python FFI                 │
│                                                                                │
│  cactus test                         run the test suite                        │
│    --component <name>                kernels | graph | engine | all            │
│                                      (default: all)                            │
│    --model <hf-id>                   default: LiquidAI/LFM2-VL-450M            │
│    --transcription-model <hf-id>     default: openai/whisper-base              │
│    --suite <name>                    run a single test suite (resolved         │
│                                      across components; e.g. performance       │
│                                      → kernels + graph, llm → engine)          │
│    --list                            list components and suites                │
│    --ios                             run on connected iPhone                   │
│    --android                         run on connected Android                  │
│    --enable-telemetry                send cloud telemetry (off by default)     │
│                                                                                │
│  cactus clean                        delete build artifacts                    │
│  cactus --help                       show this help                            │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Maintaining Organisations

  1. Cactus Compute, Inc. (YC S25)
  2. UCLA's BruinAI
  3. Char (YC S25)
  4. Yale's AI Society
  5. National University of Singapore's AI Society
  6. UC Irvine's AI@UCI
  7. Imperial College's AI Society
  8. University of Pennsylvania's AI@Penn
  9. University of Michigan Ann-Arbor MSAIL
  10. University of Colorado Boulder's AI Club

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

N/B: Scroll all the way up and click the shields link for resources!