MiniCPM-o 实测：面壁智能小钢炮，端侧全双工 Omni 模型选型指南

MiniCPM-o in Practice: A Compact Omni Model and Edge-Side Selection Guide for Small Teams

Tech-Experiment

May 20, 2026

更新于 May 20, 2026

🇨🇳 中文

BLUF：MiniCPM-o 4.5 是目前开源 Omni 模型里中文友好、端侧可部署、全双工实测最接近产品级的选手。对于提供端侧 AI 和 AI Agent 服务的小团队，它是值得优先验证的”小钢炮”。

为什么值得关注这个模型？

我今天在官方 Demo（minicpmo45.modelbest.cn）上实测了 MiniCPM-o 4.5，重点测试了两项能力：

全双工视频通话：摄像头画面实时传入，模型边看边说，延迟体感流畅，接近产品级
全双工音频通话：打断、接续、情绪感知均有体现，turn-based 对话节奏自然

这是我测试过的开源 Omni 模型里体验最完整的一个。

Audio Full-Duplex 实测（中文对话，延迟 TTFS 1280ms，推理 352ms）：

MiniCPM-o 4.5 Audio Full-Duplex 实测截图

Omni Full-Duplex 实测 — 中文通话（模型实时看到摄像头画面并用中文回复，背景已隐私模糊）：

MiniCPM-o 4.5 Omni Full-Duplex 中文通话实测

Omni Full-Duplex 实测 — 英文通话（切换 English Call preset，模型主动描述画面内容并用英文对话）：

MiniCPM-o 4.5 Omni Full-Duplex 英文通话实测

核心能力：9B 参数，四模态端到端

模型架构（MiniCPM-o 4.5）：

组件	来源模型
视觉编码器	SigLip2
音频编码器	Whisper-medium
音频解码器	CosyVoice2
LLM 主干	Qwen3-8B
总参数	9B 端到端

关键指标：

OpenCompass 综合分 77.6，超过 GPT-4o，接近 Gemini 2.5 Flash
视觉输入：最高 1.8M 像素，10fps 实时视频流
语言：支持 30+ 语言，中文原生支持
模态切换延迟：< 0.1ms
推理延迟：A100 下 ~0.9s，开启 torch.compile 优化后 ~0.5s

开发者集成指南

1. 快速启动（Docker 推荐）

# 官方 Docker 镜像，28GB+ NVIDIA VRAM
docker pull openbmb/minicpmo:latest
docker run --gpus all -p 8000:8000 openbmb/minicpmo:latest

2. PyTorch 直接调用

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-2_6",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)
model = model.to(device='cuda')

# 全双工语音流：传入音频帧，实时获取回复
response = model.chat(
    msgs=[{"role": "user", "content": "你好，我想咨询一下..."}],
    audio_input=audio_frames,   # 实时音频帧
    stream=True
)

3. 高吞吐部署（vLLM / SGLang）

# vLLM 部署，支持多并发
python -m vllm.entrypoints.openai.api_server \
    --model openbmb/MiniCPM-o-2_6 \
    --dtype bfloat16 \
    --max-model-len 8192

4. 低资源部署（llama.cpp / Ollama）

CPU 或消费级 GPU 可用量化版本：

# GGUF 格式，支持 Mac M 系列 / Windows
ollama pull minicpm-o:4bit
ollama run minicpm-o:4bit

Int4 量化约需 8–12GB 内存，M3 Max 64GB 可流畅运行。

5. 桌面客户端

官方提供 Windows & macOS 桌面 App（基于 llama.cpp-omni），开箱即用，无需编程。

Omni 模型横向选型对比

面向小团队的选型矩阵（端侧 AI + AI Agent 场景）：

模型	开源	实时全双工	中文	端侧部署	延迟	成本
MiniCPM-o 4.5	✅	✅ 视频+音频	✅ 原生	✅ GGUF/Ollama	~0.5s	自托管
GPT-4o Realtime	❌	✅ 音频	✅	❌ 仅 API	竞争性	~$0.10/min
Gemini Live 2.5	❌	✅ 音频	✅	❌ 仅 API	0.63s TTFA	$0.011/min
Qwen2.5-Omni	✅	✅ 全模态	✅ 原生	✅ Int4/AWQ	实时	自托管
Moshi	✅	✅ 纯语音	⚠️ 待测	✅	160ms	自托管
InternVL3.5	✅	❌ 仅视觉	✅	部分	—	自托管

选型建议：

端侧/私有化 + 中文全双工：首选 MiniCPM-o 4.5，其次 Qwen2.5-Omni
最低成本云端 API：Gemini Live（$0.011/min，24 语言）
最强工具调用集成：GPT-4o Realtime（Function Calling in Realtime API）
极低延迟纯语音：Moshi（160ms，CC-BY 4.0）
视觉理解超大模型：InternVL3.5 241B（接近 GPT-5 水平，无实时语音）

对小团队的实践建议

作为提供端侧 AI 和 AI Agent 服务的小团队，MiniCPM-o 的核心价值在于：

低门槛验证产品原型：Mac M3 Max 本地跑 4-bit 量化版，无需 GPU 服务器
全双工视频能力商用：视频客服、远程辅导、AI 陪伴类场景直接可用
完整 SDK 覆盖：PyTorch → vLLM → Ollama → 桌面 App，不同阶段按需切换
中文优先：与 Moshi 等欧美模型相比，中文语音质量和语义理解更稳定

建议路径：

先在 minicpmo45.modelbest.cn 试用官方 Demo 验证场景可行性
用 Ollama 4-bit 版在本地 Mac 搭建原型
产品化阶段迁移至 vLLM 多并发部署

© 2026 Author: Mycelium Protocol. 本文采用 CC BY 4.0 授权——欢迎转载和引用，须注明作者姓名及原文链接，不得去除署名后以原创发布。

🇬🇧 English

BLUF: MiniCPM-o 4.5 is the most production-ready open-source omni model for Chinese-language, edge-deployable, full-duplex use cases. For small teams building edge AI or AI Agent products, it’s the compact powerhouse worth validating first.

Why Pay Attention to This Model?

I tested MiniCPM-o 4.5 on the official demo (minicpmo45.modelbest.cn) and focused on two key capabilities:

Full-duplex video call: Live camera feed streamed in, model responds while watching — latency felt smooth and near production quality
Full-duplex audio call: Interruption handling, continuation, and emotional awareness all present; turn-based rhythm felt natural

This is the most complete open-source Omni model experience I’ve tested.

Audio Full-Duplex test (Chinese conversation, TTFS 1280ms, inference 352ms):

MiniCPM-o 4.5 Audio Full-Duplex test screenshot

Omni Full-Duplex — Chinese call (model sees live camera feed and responds in Chinese; background privacy-blurred):

MiniCPM-o 4.5 Omni Full-Duplex Chinese call

Omni Full-Duplex — English call (switched to English Call preset; model proactively describes the scene and converses in English):

MiniCPM-o 4.5 Omni Full-Duplex English call

Core Capabilities: 9B Parameters, Four-Modality End-to-End

Model architecture (MiniCPM-o 4.5):

Component	Source Model
Vision Encoder	SigLip2
Audio Encoder	Whisper-medium
Audio Decoder	CosyVoice2
LLM Backbone	Qwen3-8B
Total Parameters	9B end-to-end

Key metrics:

OpenCompass aggregate score 77.6 — outperforms GPT-4o, approaches Gemini 2.5 Flash
Vision input: up to 1.8M pixels, 10fps real-time video stream
Languages: 30+, native Chinese support
Mode-switching latency: < 0.1ms
Inference latency: ~0.9s on A100, ~0.5s with torch.compile

Developer Integration Guide

1. Quick Start (Docker Recommended)

docker pull openbmb/minicpmo:latest
docker run --gpus all -p 8000:8000 openbmb/minicpmo:latest

Requires 28GB+ NVIDIA VRAM.

2. Direct PyTorch

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-2_6",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to(device='cuda')

response = model.chat(
    msgs=[{"role": "user", "content": "Hello, I'd like to ask..."}],
    audio_input=audio_frames,
    stream=True
)

3. High-Throughput (vLLM / SGLang)

python -m vllm.entrypoints.openai.api_server \
    --model openbmb/MiniCPM-o-2_6 \
    --dtype bfloat16

4. Low-Resource (llama.cpp / Ollama)

ollama pull minicpm-o:4bit
ollama run minicpm-o:4bit

Int4 quantization requires ~8–12GB RAM — runs smoothly on M3 Max 64GB.

5. Desktop App

Official Windows & macOS desktop app (llama.cpp-omni based) — no coding required, out of the box.

Omni Model Selection Matrix

For small teams building edge AI + AI Agent products:

Model	Open Source	Real-time Duplex	Chinese	Edge Deploy	Latency	Cost
MiniCPM-o 4.5	✅	✅ Video+Audio	✅ Native	✅ GGUF/Ollama	~0.5s	Self-hosted
GPT-4o Realtime	❌	✅ Audio	✅	❌ API only	Competitive	~$0.10/min
Gemini Live 2.5	❌	✅ Audio	✅	❌ API only	0.63s TTFA	$0.011/min
Qwen2.5-Omni	✅	✅ Full modal	✅ Native	✅ Int4/AWQ	Real-time	Self-hosted
Moshi	✅	✅ Voice only	⚠️ Unknown	✅	160ms	Self-hosted
InternVL3.5	✅	❌ Vision only	✅	Partial	—	Self-hosted

Selection guidance:

Edge/private + Chinese full-duplex: MiniCPM-o 4.5 first, then Qwen2.5-Omni
Lowest cost cloud API: Gemini Live ($0.011/min, 24 languages)
Strongest tool-calling integration: GPT-4o Realtime (Function Calling in Realtime API)
Minimum latency pure voice: Moshi (160ms, CC-BY 4.0)
Largest vision model: InternVL3.5 241B (near GPT-5 on vision benchmarks, no real-time speech)

Practical Advice for Small Teams

As a small team delivering edge AI and AI Agent services, MiniCPM-o’s core value is:

Low-barrier prototype validation: Run 4-bit quantized locally on Mac M3 Max — no GPU server needed
Full-duplex video for commercial use: AI customer service, remote tutoring, companion AI — ready now
Full SDK coverage: PyTorch → vLLM → Ollama → Desktop App, switch as you scale
Chinese-first: Significantly more stable Chinese voice quality and semantics than Western alternatives like Moshi

Recommended path:

Try the official demo at minicpmo45.modelbest.cn to validate your target scenario
Set up a local prototype with Ollama 4-bit on Mac
Migrate to vLLM multi-concurrent deployment when productizing

Source: GitHub — OpenBMB/MiniCPM-o-Demo

© 2026 Author: Mycelium Protocol. Licensed under CC BY 4.0 — free to share and adapt with attribution. You must credit the author and link to the original; removing attribution and republishing as original is not permitted.

💬 评论与讨论

使用 GitHub 账号登录后发表评论