Cactus Compute distilled Gemini into a 26M tool-calling model. The trick: no feed-forward layers.

Needle is a 26M-parameter function caller distilled from Gemini 3.1 Flash-Lite. The Simple Attention Network drops MLPs and runs at 6,000 tok/s prefill on edge silicon.

Cactus Compute posted Needle to Hacker News on May 12, a 26-million-parameter function-calling model distilled from Gemini 3.1 Flash-Lite. The repository hit 415 points and 144 comments inside 24 hours, and the Hugging Face weights shipped under MIT alongside the dataset-generation code.

The architectural claim is the part the comments fixated on. The team threw out the feed-forward layer entirely. What’s left is a 12-layer encoder, an 8-layer decoder, cross-attention between them for tool routing, and a contrastive head that picks the relevant tool from a list. The architecture doc calls it a Simple Attention Network, and Cactus is betting that “simple” is the right word for sub-50M models targeting on-device function calling.

What’s in the box

Model size: 26 million parameters. For comparison, Gemma 3 270M is roughly 10x larger; Qwen-0.6B is 23x larger; LFM2.5-350m is 13x larger.
Training budget: 200 billion tokens of pretraining on 16 TPU v6e chips over 27 hours. Post-training is 2 billion synthetic function-calling tokens generated with Gemini, covering 15 categories including timers, messaging, navigation, and smart-home controls. 45 minutes for the post-train pass.
Throughput on Cactus’s own runtime: 6,000 tokens/sec prefill and 1,200 decode. Cactus doesn’t publish the silicon used for that number, but the company’s broader work targets smartphones, watches, and edge boxes. The numbers are within range of phone-class accelerators.
License: MIT. Both the model weights and the dataset-generation pipeline are open. Weights are on Hugging Face at Cactus-Compute/needle.
Benchmark claim: Cactus says Needle beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling for personal AI tasks. The repo notes the comparison is single-shot only, and the four reference models retain broader general capability.

Why no feed-forward

Transformer feed-forward networks chew about two-thirds of the per-layer parameter budget. Cactus’s argument, in the architecture doc, is that softmax inside attention already provides “data-dependent routing,” which is the thing FFN layers are mostly doing on a function-calling workload anyway. Strip the FFN, route through cross-attention, and depth becomes a cheaper substitute for width.

The substitutions are deliberate. Standard residual connections are replaced with gated residuals initialized at 0.5, so each layer learns its own contribution strength. ZCRMSNorm sits in for the usual LayerNorm and starts close to identity, which pairs with the gated residuals for stable training without warmup. The Muon optimizer enforces weight orthogonality across the deep linear stack, preventing the representation collapse that hits FFN-less networks if you just train them naively.

Whether this generalizes past function calling is the open research question. The HN comment thread is split: half of the engineers there read the design as a clean specialization for a constrained task, and half read it as a hint that the FFN/MLP-heavy transformer template is a poor fit for retrieval-shaped problems in general.

What’s missing from the announcement

A side-by-side benchmark table with absolute scores. The README says “outperforms” four reference models but doesn’t print the per-task numbers. Reproducing the comparison currently requires running the included evaluation scripts.
The hardware used to clock the 6,000/1,200 throughput numbers. A 26M-param model behaves very differently on a Snapdragon NPU, a Tensor G5, a Pixel-class TPU shim, and a laptop CPU. Cactus’s commercial runtime presumably benchmarks well; reproducing the speed independently is the test that matters.
The license on the synthetic function-calling dataset. The repo says the dataset-generation code is open. The license on the resulting data, given it was generated with Gemini, is the question most production teams will need to answer before pulling Needle into a shipped product.

What this means for you

If you’re building a phone or watch assistant and you’ve been waiting for a tool-caller small enough to run on the device without the cloud round trip, Needle is the first 26M-class option that claims it can hold a function-calling rubric against models ten times its size. The MIT license removes the friction the closed competitors carry. The next step is reproducing Cactus’s benchmark on your own evaluation set before you trust it on user traffic.

If you’re a researcher, the Simple Attention Network ablation is the more interesting artifact. The dataset-generation pipeline is open, the optimizer choices are documented, and a 27-hour pretraining run on 16 TPU v6e chips is small enough for a university group to repeat. The question of whether dropping the FFN works outside this narrow task is the thesis someone is going to write next quarter.

If you watch the on-device AI race, Cactus is the operator to watch. The company runs a Y Combinator-backed runtime that already ships voice-agent hackathon kits with Google DeepMind, and Needle is the first time it’s published a model that competes on a public benchmark category. Apple, Google, and Qualcomm have all been telegraphing on-device function calling for the next two years; a startup just shipped a working version under MIT.

Cactus Compute distilled Gemini into a 26M tool-calling model. The trick: no feed-forward layers.

What’s in the box

Why no feed-forward

What’s missing from the announcement

What this means for you

Share this article

Quick reference

Sources

Mentioned in this article