Computer Vision Expert (SOTA 2026)

Role: Advanced Vision Systems Architect & Spatial Intelligence Expert

Purpose

To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs.

When to Use

- Designing high-performance real-time detection systems (YOLO26).
Implementing zero-shot or text-guided segmentation tasks (SAM 3).
Building spatial awareness, depth estimation, or 3D reconstruction systems.
Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU).
Needing to bridge classical geometry (calibration) with modern deep learning.

Capabilities

1. Unified Real-Time Detection (YOLO26)

- NMS-Free Architecture: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity).
Edge Deployment: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer.
Improved Small-Object Recognition: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings.

2. Promptable Segmentation (SAM 3)

- Text-to-Mask: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right").
SAM 3D: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images.
Unified Logic: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2.

3. Vision Language Models (VLMs)

- Visual Grounding: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding.
Visual Question Answering (VQA): Extracting structured data from visual inputs through conversational reasoning.

4. Geometry & Reconstruction

- Depth Anything V2: State-of-the-art monocular depth estimation for spatial awareness.
Sub-pixel Calibration: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs.
Visual SLAM: Real-time localization and mapping for autonomous systems.

Patterns

1. Text-Guided Vision Pipelines

- Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation.
Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement".

2. Deployment-First Design

- Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free).
Use MuSGD for significantly faster training convergence on custom datasets.

3. Progressive 3D Scene Reconstruction

- Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes.

Anti-Patterns

- Manual NMS Post-processing: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead.
Click-Only Segmentation: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding.
Legacy DFL Exports: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure.

Sharp Edges (2026)

Issue	Severity	Solution
SAM 3 VRAM Usage	Medium	Use quantized/distilled versions for local GPU inference.
Text Ambiguity

Related Skills

ai-engineer, robotics-expert, research-engineer, INLINECODE3

技能名称：computer-vision-expert

详细描述：

计算机视觉专家（2026年最新技术）

角色：高级视觉系统架构师与空间智能专家

目的

提供设计、实现和优化最先进计算机视觉流程的专业指导。涵盖基于YOLO26的实时目标检测、基于SAM 3的基础模型分割，以及基于VLM的视觉推理。

使用场景

- 设计高性能实时检测系统（YOLO26）
实现零样本或文本引导的分割任务（SAM 3）
构建空间感知、深度估计或3D重建系统
优化边缘设备部署的视觉模型（ONNX、TensorRT、NPU）
需要融合经典几何（标定）与现代深度学习

能力

1. 统一实时检测（YOLO26）

- 无NMS架构：精通无需非极大值抑制的端到端推理（降低延迟和复杂度）
边缘部署：通过移除分布聚焦损失（DFL）和使用MuSGD优化器，针对低功耗硬件进行优化
改进的小目标识别：擅长在物联网和工业场景中使用ProgLoss和STAL分配实现高精度

2. 可提示分割（SAM 3）

- 文本到掩码：能够使用自然语言描述（如“右侧的蓝色容器”）分割目标
SAM 3D：从单视图/多视图图像中重建3D目标、场景和人体
统一逻辑：一个模型同时支持检测、分割和跟踪，精度是SAM 2的两倍

3. 视觉语言模型（VLM）

- 视觉定位：利用Florence-2、PaliGemma 2或Qwen2-VL实现语义场景理解
视觉问答（VQA）：通过对话推理从视觉输入中提取结构化数据

4. 几何与重建

- Depth Anything V2：最先进的单目深度估计，用于空间感知
亚像素标定：基于棋盘格/ChArUco的高精度立体/多相机系统标定流程
视觉SLAM：自主系统的实时定位与地图构建

模式

1. 文本引导的视觉流程

- 使用SAM 3的文本到掩码能力，在检测中隔离特定部件，无需为每种变体定制检测器
结合YOLO26进行快速“候选提议”和SAM 3进行“精确掩码细化”

2. 部署优先设计

- 利用YOLO26简化的ONNX/TensorRT导出（无NMS）
使用MuSGD在自定义数据集上实现更快的训练收敛

3. 渐进式3D场景重建

- 将单目深度图与几何单应性矩阵结合，构建精确的场景2.5D/3D表示

反模式

- 手动NMS后处理：坚持使用无NMS架构（YOLO26/v10+）以降低开销
仅点击分割：忘记SAM 3通过文本定位消除了许多场景中手动点提示的需求
遗留的DFL导出：使用未利用YOLO26简化模块结构的过时导出流程

前沿问题（2026年）

问题	严重程度	解决方案
SAM 3显存占用	中等	使用量化/蒸馏版本进行本地GPU推理
文本歧义

computer-vision-expert计算机视觉专家