Needle：2600 万参数，把 Gemini 的工具调用能力塞进手表和眼镜

Needle: 26M Parameters — Distilling Gemini's Function Calling Into Watches and Glasses

May 13, 2026

更新于 May 13, 2026

🇨🇳 中文

结论先行（BLUF）：Needle 是 Cactus Compute 开源的一个 2600 万参数模型，专门做 AI Agent 的工具调用（function calling）。它的核心反常识设计是砍掉 FFN 层——把通常占 Transformer 参数量 2/3 的 MLP 整个去掉，只留注意力机制。理由是：工具调用本质上是”检索 + 对齐 + 拼装 JSON”，不需要 FFN 提供的逐位置特征变换。结果：26M 参数，手机/手表/眼镜可跑，单轮工具调用性能超过体量是它 10 倍以上的竞品。

GitHub：cactus-compute/needle
权重：Hugging Face / Cactus-Compute/needle
协议：MIT，权重和数据生成流程完全开放

核心定位：让 Agent 真正跑进极小设备

Needle 的目标是把 Gemini 3.1 Flash Lite 的工具调用能力，通过知识蒸馏塞进一个极小的 Simple Attention Network（SAN）里。在 Cactus 推理引擎上的实测数据：

Prefill 速度：6000 tokens/s
解码速度：1200 tokens/s

这两个数字意味着什么？在手机上做一次完整的工具调用（接收用户语音 → 解析意图 → 选工具 → 组装 JSON 参数），延迟可以压到毫秒级。这是手表和眼镜形态的 AI Agent 真正能用的前提条件。

最反常识的设计：去掉 FFN

标准 Transformer 的参数分布大致是：注意力 1/3，FFN 2/3。Needle 把那个 2/3 整个删掉了。

理由：工具调用的全程逻辑是”对齐和复制”——

把用户 query 与工具名对齐（检索）
从 query 中抽取参数值（复制/提取）
组装成结构化 JSON（拼装）

这三步都是注意力机制擅长的事。FFN 提供的”逐位置非线性特征变换”在这里是冗余的。砍掉之后：参数更少 → 显存带宽压力更低 → 边缘设备上推理直接更快。

架构细节

整体结构：encoder-decoder，12 层编码器 + 8 层解码器，d=512，8 头注意力 / 4 个 KV 头，BPE 词表 8192。编码器双向看完整工具定义，解码器通过 cross-attention 取用，KV cache 里不放输入 token。

配套技巧（每一条都有明确的工程动机）：

技巧	作用
Gated Residual	可学习的 sigmoid 门控残差，初始 0.5，保持梯度通路
ZCRMSNorm	γ 初始为 0 的零中心 RMSNorm，训练起点即恒等映射
CLIP 风格对比学习头	从大工具集中先检索 top-k，再细粒度解码
Muon + AdamW 双优化器	Muon 用 Newton-Schulz 保持 Q/K/V/O 投影正交，防止无 FFN 时的表征塌缩
INT4 QAT	每 100 步做一次伪量化，正则化 + 消除训练-部署量化 gap
Token 级损失加权	参数值 4x、工具名 2x、键 1.5x、结构 token 1x

其中 Muon 优化器是专门为”无 FFN”架构设计的保险——没有 FFN 时注意力投影矩阵容易退化，Newton-Schulz 迭代保持正交性，防止表征塌缩。

训练规模

预训练：16 张 TPU v6e，PleIAs/SYNTH 数据，2000 亿 tokens，耗时 27 小时。

后训练：Gemini 合成的 20 亿 tokens 单轮 function call 数据，覆盖定时器、消息、导航、智能家居等 15 个类别，耗时 45 分钟。

性能对比

在单轮工具调用任务上，Needle（26M）优于：

FunctionGemma-270m（约为 Needle 的 10 倍参数量）
Qwen-0.6B（约为 Needle 的 23 倍参数量）
Granite-350m
LFM2.5-350m

作者也坦承：这些更大的模型在多轮对话场景里有更广的能力，Needle 的优势是单轮工具调用这个精确的战场。

怎么用

git clone https://github.com/cactus-compute/needle
cd needle
pip install -e .
needle playground
# 打开 http://localhost:7860，用自己的工具集测试并一键微调

Mac/PC 均可运行，不需要 GPU 云资源。

为什么值得关注

Needle 的意义不在于”又一个小模型”，而在于它验证了一个架构假设：当任务边界足够清晰时，可以大胆砍掉通用 Transformer 里的冗余组件。 工具调用不需要 FFN；未来类似的专用边缘 AI 也可以用同样的思路——找到任务的本质操作，只保留对应的网络结构。

这对 Agent 架构的启示是：感知层（眼睛/耳朵/嘴巴）用小模型处理 I/O，工具调用路由用 Needle 这类极小专用模型，深层推理才上大模型——三层分工，整个 Agent 可以真正跑在本地设备上。

参考链接

© 2026 Author: Mycelium Protocol. 本文采用 CC BY 4.0 授权——欢迎转载和引用，须注明作者姓名及原文链接，不得去除署名后以原创发布。

🇬🇧 English

BLUF: Needle is a 26M-parameter open-source model from Cactus Compute, purpose-built for AI agent tool calling (function calling). Its core counterintuitive design is dropping the FFN layer entirely — removing the MLP that typically accounts for 2/3 of a Transformer’s parameters, keeping only the attention mechanism. The rationale: tool calling is fundamentally “retrieval + alignment + JSON assembly” — none of which requires the positional feature transformation FFN provides. Result: 26M parameters, runs on phones/watches/glasses, outperforms competitors 10x its size on single-turn tool calling.

GitHub: cactus-compute/needle
Weights: Hugging Face / Cactus-Compute/needle
License: MIT, weights and data generation pipeline fully open

Core Position: Agent That Actually Runs on Tiny Devices

Needle distills Gemini 3.1 Flash Lite’s tool calling capability into a Simple Attention Network (SAN) through knowledge distillation. Benchmarks on the Cactus inference engine:

Prefill speed: 6,000 tokens/s
Decode speed: 1,200 tokens/s

On a phone, a complete tool call cycle (receive voice → parse intent → select tool → assemble JSON parameters) can complete in milliseconds — the prerequisite for AI agents that actually work in watch and glasses form factors.

The Counterintuitive Design: Drop the FFN

Standard Transformer parameter distribution: ~1/3 attention, ~2/3 FFN. Needle deletes that 2/3 entirely.

Rationale: The entire logic of tool calling is “align and copy”:

Match user query to tool names (retrieval)
Extract parameter values from query (copy/extract)
Assemble into structured JSON (assembly)

These are all what attention mechanisms excel at. The FFN’s “per-position nonlinear feature transformation” is redundant here. Dropping it: fewer parameters → lower memory bandwidth pressure → faster inference on edge devices.

Architecture Details

Structure: encoder-decoder, 12 encoder layers + 8 decoder layers, d=512, 8-head attention / 4 KV heads, BPE vocabulary 8192. Encoder sees complete tool definitions bidirectionally; decoder accesses via cross-attention; KV cache excludes input tokens.

Engineering techniques (each with explicit motivation):

Technique	Purpose
Gated Residual	Learnable sigmoid-gated residual, init 0.5, maintains gradient path
ZCRMSNorm	Zero-centered RMSNorm with γ=0 init, identity mapping at training start
CLIP-style contrastive head	Retrieves top-k from large tool sets before fine-grained decoding
Muon + AdamW dual optimizer	Muon uses Newton-Schulz to keep Q/K/V/O projections orthogonal, preventing representation collapse without FFN
INT4 QAT	Pseudo-quantization every 100 steps: regularization + eliminates train-deploy quantization gap
Token-level loss weighting	Parameter values 4x, tool names 2x, keys 1.5x, structure tokens 1x

Training Scale

Pre-training: 16× TPU v6e, PleIAs/SYNTH data, 200B tokens, 27 hours.

Post-training: 2B tokens of Gemini-synthesized single-turn function call data covering 15 categories (timers, messages, navigation, smart home, etc.), 45 minutes.

Performance

On single-turn tool calling, Needle (26M) outperforms:

FunctionGemma-270m (~10× Needle’s parameters)
Qwen-0.6B (~23× Needle’s parameters)
Granite-350m, LFM2.5-350m

The author honestly acknowledges these larger models have broader capability in multi-turn dialogue — Needle’s advantage is the precise battlefield of single-turn tool calling.

Quick Start

git clone https://github.com/cactus-compute/needle
cd needle
pip install -e .
needle playground
# Open http://localhost:7860, test with your own tool set, one-click fine-tuning

Runs on Mac/PC, no GPU cloud resources required.

Why This Matters

Needle’s significance isn’t “another small model” — it validates an architectural hypothesis: when task boundaries are sufficiently clear, you can aggressively drop redundant components from general-purpose Transformers. Tool calling doesn’t need FFN; future specialized edge AI can apply the same logic — identify the essential operations for a task, keep only the corresponding network structures.

The implication for agent architecture: perception layer (eyes/ears/voice) uses small models for I/O, tool call routing uses ultra-small specialists like Needle, deep reasoning uses large models only when needed — three-layer division of labor, with the entire agent running genuinely on local devices.

References

© 2026 Author: Mycelium Protocol. Licensed under CC BY 4.0 — free to share and adapt with attribution. You must credit the author and link to the original; removing attribution and republishing as original is not permitted.

💬 评论与讨论

使用 GitHub 账号登录后发表评论