Perceptron — Vision SDK
Docs: https://docs.perceptron.inc/
Image and video analysis via the Perceptron Python SDK. Pass file paths or URLs directly — the SDK handles base64 conversion automatically.
Setup
CODEBLOCK0
Quick Reference
| Task | Function | Example |
|---|
| Describe / Q&A | INLINECODE0 | INLINECODE1 |
| Grounded Q&A |
question() |
question("photo.jpg", "Where is the cat?", expects="box") |
| Object detection |
detect() |
detect("photo.jpg", classes=["person", "car"]) |
| OCR |
ocr() |
ocr("document.png") |
| OCR (markdown) |
ocr_markdown() |
ocr_markdown("document.png") |
| Caption |
caption() |
caption("photo.jpg", style="detailed") |
| Counting |
question() |
question("photo.jpg", "How many dogs?", expects="point") |
| Custom workflow |
@perceive | See DSL composition below |
Python SDK
CODEBLOCK1
DSL Composition (Advanced)
Build custom multimodal workflows:
CODEBLOCK2
Structured Outputs
Constrain responses to Pydantic models, JSON schemas, or regex:
CODEBLOCK3
Pixel Coordinate Conversion
All spatial outputs use normalized coordinates (0–1000). Convert to pixels:
CODEBLOCK4
CLI Script
Located at: INLINECODE15
Requires PERCEPTRON_API_KEY environment variable. The provider is always perceptron.
CODEBLOCK5
Models
| Model | Best for | Speed | Temp |
|---|
| INLINECODE18 (default) | General use, detection, OCR | Fast | 0.0 |
| INLINECODE19 |
Quick/simple tasks | Fastest | 0.0 |
Override with model="..." in any SDK call or --model ... in CLI.
Grounding (expects parameter)
| Value | Returns | Use case |
|---|
| INLINECODE22 (default) | Plain text | Q&A, descriptions, OCR |
| INLINECODE23 |
Bounding boxes | Detection, localization |
|
point | Point coordinates | Counting, pointing |
|
polygon | Polygon vertices | Segmentation |
Video Analysis
Extract frames with ffmpeg, then analyze:
CODEBLOCK6
For continuous monitoring, extract multiple frames and batch process.
Reference Files
For deeper SDK usage, consult these when needed:
- - references/capabilities.md — Focus mode, reasoning, streaming, ICL, structured outputs, annotation types
- references/prompting.md — Optimal prompts per task, vision hints (
<hint>BOX</hint>), temperature guide - references/api.md — SDK configuration, models, image formats, streaming, best practices
Perceptron — Vision SDK
文档:https://docs.perceptron.inc/
通过 Perceptron Python SDK 进行图像和视频分析。直接传入文件路径或URL——SDK会自动处理base64转换。
设置
bash
pip install perceptron
export PERCEPTRONAPIKEY=ak_...
快速参考
| 任务 | 函数 | 示例 |
|---|
| 描述 / 问答 | question() | question(photo.jpg, 这张图片里有什么?) |
| 定位问答 |
question() | question(photo.jpg, 猫在哪里?, expects=box) |
| 目标检测 | detect() | detect(photo.jpg, classes=[person, car]) |
| OCR | ocr() | ocr(document.png) |
| OCR(Markdown格式) | ocr
markdown() | ocrmarkdown(document.png) |
| 图像描述 | caption() | caption(photo.jpg, style=detailed) |
| 计数 | question() | question(photo.jpg, 有几只狗?, expects=point) |
| 自定义工作流 | @perceive | 参见下方DSL组合 |
Python SDK
python
from perceptron import configure, detect, caption, ocr, ocr_markdown, question
配置(或设置PERCEPTRONAPIKEY环境变量)
configure(provider=perceptron, api
key=ak...)
视觉问答——最常用的操作
result = question(photo.jpg, 这张图片里发生了什么?)
print(result.text)
定位问答——获取带答案的边界框
result = question(photo.jpg, 损坏在哪里?, expects=box)
for box in result.points or []:
print(f{box.mention}: ({box.top
left.x},{box.topleft.y}) → ({box.bottom
right.x},{box.bottomright.y}))
目标检测
result = detect(warehouse.jpg, classes=[forklift, person])
for box in result.points or []:
print(f{box.mention}: ({box.top
left.x},{box.topleft.y}) → ({box.bottom
right.x},{box.bottomright.y}))
OCR
result = ocr(receipt.jpg, prompt=提取总金额)
print(result.text)
result = ocr_markdown(document.png) # 结构化Markdown输出
print(result.text)
图像描述
result = caption(scene.png, style=detailed)
print(result.text)
DSL组合(高级)
构建自定义多模态工作流:
python
from perceptron import perceive, image, text, system
@perceive(expects=box, model=isaac-0.2-2b-preview)
def findhazards(imgpath):
return [system(BOX), image(img_path), text(找出所有安全隐患)]
result = find_hazards(factory.jpg)
结构化输出
将响应约束为Pydantic模型、JSON模式或正则表达式:
python
from perceptron import perceive, image, text, pydantic_format
from pydantic import BaseModel
class Scene(BaseModel):
objects: list[str]
count: int
@perceive(responseformat=pydanticformat(Scene))
def count_objects(path):
return image(path) + text(列出所有目标并计数。返回JSON格式。)
result = count_objects(photo.jpg)
scene = Scene.modelvalidatejson(result.text)
像素坐标转换
所有空间输出使用归一化坐标(0–1000)。转换为像素:
python
pixelboxes = result.pointsto_pixels(width=1920, height=1080)
或独立使用:
from perceptron import scale
pointsto_pixels
pixel
pts = scalepoints
topixels(result.points, width=1920, height=1080)
CLI脚本
位置:/scripts/perceptron_cli.py
需要PERCEPTRONAPIKEY环境变量。提供商始终为perceptron。
bash
P=/scripts/perceptron_cli.py
视觉问答
python3 $P question photo.jpg 你看到了什么?
python3 $P question photo.jpg 车在哪里? --expects box
目标检测
python3 $P detect photo.jpg --classes person,car
python3 $P detect photo.jpg --classes forklift --format json --pixels
python3 $P detect ./frames/ --classes defect # 批量处理目录
OCR
python3 $P ocr document.png
python3 $P ocr receipt.jpg --output markdown
图像描述
python3 $P caption scene.png --style detailed
自定义感知
python3 $P perceive frame.png --prompt 描述这个场景 --expects box
批量处理
python3 $P batch --images img1.jpg img2.jpg --prompt 描述 --output results.json
解析原始模型输出
python3 $P parse
--mode points
列出模型
python3 $P models
模型
| 模型 | 最佳用途 | 速度 | 温度 |
|---|
| isaac-0.2-2b-preview(默认) | 通用、检测、OCR | 快 | 0.0 |
| isaac-0.2-1b |
快速/简单任务 | 最快 | 0.0 |
在任何SDK调用中使用model=...或在CLI中使用--model ...覆盖。
定位(expects参数)
| 值 | 返回 | 使用场景 |
|---|
| text(默认) | 纯文本 | 问答、描述、OCR |
| box |
边界框 | 检测、定位 |
| point | 点坐标 | 计数、指向 |
| polygon | 多边形顶点 | 分割 |
视频分析
使用ffmpeg提取帧,然后分析:
bash
在5秒处提取单帧
ffmpeg -ss 5 -i video.mp4 -frames:v 1 -q:v 2 /tmp/frame.jpg
然后分析
python3 $P question /tmp/frame.jpg 发生了什么?
如需持续监控,提取多帧并进行批量处理。
参考文件
如需深入了解SDK用法,请查阅以下文件: