qwencloud-vision
> **Agent setup**: If your agent doesn't auto-load skills (e.g. Claude Code),
> see [agent-compatibility.md](references/agent-compatibility.md) once per session.
# Qwen Vision (Image & Video Understanding)
Analyze images and videos using Qwen VL and QVQ models.
This skill is part of **qwencloud/qwencloud-ai**.
## Skill directory
Use this skill's internal files to execute and learn. Load reference files on demand when the default path fails or you need details.
| Location | Purpose |
|----------|---------|
| `scripts/analyze.py` | Image/video understanding, multi-image, thinking mode |
| `scripts/reason.py` | Visual reasoning (QVQ, chain-of-thought, streaming) |
| `scripts/ocr.py` | OCR text extraction |
| `scripts/vision_lib.py` | Shared helpers (base64, upload, streaming) |
| `references/execution-guide.md` | Fallback: curl, code generation |
| `references/curl-examples.md` | Curl for base64, multi-image, video, OCR |
| `references/visual-reasoning.md` | QVQ and thinking mode details |
| `references/prompt-guide.md` | Query prompt templates by task, thinking mode decision |
| `references/ocr.md` | OCR parameters and examples |
| `references/sources.md` | Official documentation URLs |
| `references/agent-compatibility.md` | Agent self-check: register skills in project config for agents that don't auto-load |
## Security
**NEVER output any API key or credential in plaintext.** Always use variable references (`$DASHSCOPE_API_KEY` in shell, `os.environ["DASHSCOPE_API_KEY"]` in Python). Any check or detection of credentials must be **non-plaintext**: report only status (e.g. "set" / "not set", "valid" / "invalid"), never the value. Never display contents of `.env` or config files that may contain secrets.
**When the API key is not configured, NEVER ask the user to provide it directly.** Instead, help create a `.env` file with a placeholder (`DASHSCOPE_API_KEY=sk-your-key-here`) and instruct the user to replace it with their actual key from the [QwenCloud Console](https://home.qwencloud.com/api-keys). Only write the actual key value if the user explicitly requests it.
## Key Compatibility
Scripts require a **standard QwenCloud API key** (`sk-...`). Coding Plan keys (`sk-sp-...`) cannot be used for direct API calls and do not support dedicated vision models (qwen3-vl-plus, qvq-max, etc.). The scripts detect `sk-sp-` keys at startup and print a warning. If qwencloud-ops-auth is installed, see its `references/codingplan.md` for full details.
## Model Selection
| Model | Use Case |
|-------|----------|
| **qwen3.5-plus** | **Preferred** — unified multimodal (text+image+video). Thinking on by default. |
| **qwen3.5-flash** | Fast multimodal — cheaper, faster. Thinking on by default. |
| **qwen3-vl-plus** | High-precision — object localization (2D/3D), document/webpage parsing. |
| **qwen3-vl-flash** | Fast vision — lower latency, 33 languages. |
| **qvq-max** | Visual reasoning — chain-of-thought for math, charts. **Streaming only.** |
| **qwen-vl-ocr** | OCR — text extraction, table parsing, document scanning. |
| **qwen-vl-max** | Qwen2.5-VL — best-performing in 2.5 series. |
| **qwen-vl-plus** | Qwen2.5-VL — faster, good balance of performance and cost, 11 languages. |
1. **User specified a model** → use directly.
2. **Consult the qwencloud-model-selector skill** when model choice depends on requirement, scenario, or pricing.
3. **No signal, clear task** → `qwen3.5-plus`. Use `qwen3-vl-plus` for precise localization or 3D detection.
> **⚠️ Important**: The model list above is a **point-in-time snapshot** and may be outdated. Model availability
> changes frequently. **Always check the [official model list](https://www.qwencloud.com/models)
> for the authoritative, up-to-date catalog before making model decisions.**
## Execution
### Prerequisites
- **API Key**: Check that `DASHSCOPE_API_KEY` (or `QWEN_API_KEY`) is set using a **non-plaintext** check only (e.g. in shell:
`[ -n "$DASHSCOPE_API_KEY" ]`; report only "set" or "not set", never the key value). If not set: run the *
*qwencloud-ops-auth** skill if available; otherwise guide the user to obtain a key from [QwenCloud Console](https://home.qwencloud.com/api-keys) and set it via `.env` file (
`echo 'DASHSCOPE_API_KEY=sk-your-key-here' >> .env` in project root or current directory) or environment variable. The
script searches for `.env` in the current working directory and the project root. Skills may be installed
independently — do not assume qwencloud-ops-auth is present.
- Python 3.9+ (stdlib only, **no pip install needed**)
### Environment Check
Before first execution, verify Python is available:
```bash
python3 --version # must be 3.9+
```
If `python3` is not found, try `python --version` or `py -3 --version`. If Python is unavailable or below 3.9, skip to **Path 2 (curl)** in [execution-guide.md](references/execution-guide.md).
### Default: Run Script
**Script path**: Scripts are in the `scripts/` subdirectory **of this skill's directory** (the directory containing this SKILL.md). **You MUST first locate this skill's installation directory, then ALWAYS use the full absolute path to execute scripts.** Do NOT assume scripts are in the current working directory. Do NOT use `cd` to switch directories before execution. Shared infrastructure lives in `scripts/vision_lib.py`.
**Execution note:** Run all scripts in the **foreground** — wait for stdout; do not background.
**Discovery:** Run `python3 <this-skill-dir>/scripts/analyze.py --help` (or `reason.py`, `ocr.py`) first to see all available arguments.
| Script | Purpose | Default Model |
|--------|---------|---------------|
| `scripts/analyze.py` | Image understanding, multi-image, video, thinking mode, high-res | `qwen3.5-plus` |
| `scripts/reason.py` | Visual reasoning with chain-of-thought, video reasoning (always streaming) | `qvq-max` |
| `scripts/ocr.py` | OCR text extraction from documents, receipts, tables | `qwen-vl-ocr` |
**Input type fields** (use exactly one in `--request` JSON):
| Field | Use for | Example |
|-------|---------|--------|
| `"image"` | Single image (URL or local path) | `"image": "photo.jpg"` |
| `"images"` | Multi-image comparison (array) | `"images": ["a.jpg", "b.jpg"]` |
| `"video"` | Video file (URL or local path) | `"video": "clip.mp4"` |
| `"video_frames"` | Video as frame array | `"video_frames": ["f1.jpg", "f2.jpg"]` |
> **⚠️ Common mistake**: Do NOT use `"image"` for video files — use `"video"` instead.
```bash
# Image analysis
python3 <this-skill-dir>/scripts/analyze.py \
--request '{"prompt":"What is in this image?","image":"https://example.com/photo.jpg"}' \
--output output/qwencloud-vision/result.json --print-response
# Video analysis (local file — add --upload-files for files >= 7 MB)
python3 <this-skill-dir>/scripts/analyze.py \
--request '{"prompt":"Describe what happens in this video","video":"clip.mp4"}' \
--upload-files --print-response
python3 <this-skill-dir>/scripts/reason.py \
--request '{"prompt":"Solve this math problem step by step","image":"problem.png"}' \
--print-response
python3 <this-skill-dir>/scripts/ocr.py \
--request '{"image":"invoice.jpg"}' \
--print-response
```
| Argument | Description |
|----------|-------------|
| `--request '{...}'` | JSON request body |
| `--file path.json` | Load request from file |
| `--output path` | Save response JSON to path |
| `--print-response` | Print response to stdout |
| `--stream` | Enable streaming (auto for thinking/QVQ) |
| `--upload-files` | Upload local files to temp storage (for files > 7 MB) |
| `--schema path.json` | JSON Schema for structured extraction |
### Verify Result
- Exit code `0` + output contains valid JSON with `choices` field → **success**
- Non-zero exit, HTTP error, empty response, or JSON with `"code"`/`"message"` error → **fail**
- **Post-execution check**: When `--output` is used, verify the response JSON file exists and contains expected content
- **MANDATORY — stderr signal check**: After confirming the result, scan the command's stderr output for `[ACTION_REQUIRED]` or `[UPDATE_AVAILABLE]`. If either signal is present, you **MUST** follow the instructions in [Update Check](#update-check-mandatory-post-execution) below before responding to the user.
### On Failure
If scripts fail, match the error output against the diagnostic table below to determine the resolution. If no match, read [execution-guide.md](references/execution-guide.md) for alternative paths: curl commands (Path 2), code generation (Path 3), and autonomous resolution (Path 5).
**If Python is not available at all** → skip directly to Path 2 (curl) in [execution-guide.md](references/execution-guide.md).
| Error Pattern | Diagnosis | Resolution |
|----------------------------------|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `command not found: python3` | Python not on PATH | Try `python` or `py -3`; install Python 3.9+ if missing |
| `Python 3.9+ required` | Script version check failed | Upgrade Python to 3.9+ |
| `SyntaxError` near type hints | Python < 3.9 | Upgrade Python to 3.9+ |
| `QWEN_API_KEY/DASHSCOPE_API_KEY not found` | Missing API key | Obtain key from [QwenCloud Console](https://home.qwencloud.com/api-keys); add to `.env`: `echo 'DASHSCOPE_API_KEY=sk-...' >> .env`; or run **qwencloud-ops-auth** if available |
| `HTTP 401` | Invalid or mismatched key | Run **qwencloud-ops-auth** (non-plaintext check only); verify key is valid |
| `SSL: CERTIFICATE_VERIFY_FAILED` | SSL cert issue (proxy/corporate) | macOS: run `Install Certificates.command`; else set `SSL_CERT_FILE` env var |
| `URLError` / `ConnectionError` | Network unreachable | Check internet; set `HTTPS_PROXY` if behind proxy |
| `HTTP 429` | Rate limited | Wait and retry with backoff |
| `HTTP 5xx` | Server error | Retry with backoff |
| `PermissionError` | Can't write output | Use `--output` to specify writable directory |
## File Input
The API accepts: **HTTP/HTTPS URL**, **Base64 data URI**, and **`oss://` URL**. Local file paths are NOT directly supported — scripts handle conversion automatically. **Pass local paths directly; no manual upload step needed.**
**Large file rule: If the local file is >= 7 MB, always add `--upload-files`.** Base64 encoding inflates size by ~33% and will exceed the 10 MB API limit. Small files (including short video clips < 7 MB) can use the default base64 path.
| Method | When to use | How |
|--------|-------------|-----|
| **Online URL** | File already hosted | Pass URL directly — **preferred for large files** |
| **Base64** (default) | Local files < 7 MB (images or short video clips) | Script auto-converts to `data:` URI |
| **Temp upload** | Local files >= 7 MB | Add `--upload-files` flag → uploads to DashScope temp storage (`oss://` URL, 48h TTL) |
> **Production**: Default temp storage has **48h TTL** and **100 QPS upload limit** — not suitable for production, high-concurrency, or load-testing. To use your own OSS bucket, set `QWEN_TMP_OSS_BUCKET` and `QWEN_TMP_OSS_REGION` in `.env`, install `pip install alibabacloud-oss-v2`, and provide credentials via `QWEN_TMP_OSS_AK_ID` / `QWEN_TMP_OSS_AK_SECRET` or the standard `OSS_ACCESS_KEY_ID` / `OSS_ACCESS_KEY_SECRET`. Use a RAM user with least-privilege (`oss:PutObject` + `oss:GetObject` on target bucket only). The `--upload-files` flag is still required for vision scripts to trigger upload. If qwencloud-ops-auth is installed, see its `references/custom-oss.md` for the full setup guide.
## Input from Other Skills
When the input file comes from another skill's output (e.g., image-gen, video-gen):
- **Pass the URL directly** (e.g., `"image": "<image_url from image-gen>"`) — do NOT download the URL first
- Downloading and re-passing as a local path wastes bandwidth and triggers unnecessary base64 encoding or OSS upload
- All URL types are supported: `https://`, `oss://`, `data:`
## Thinking Mode
| Model | Thinking Default | Notes |
|-------|-----------------|-------|
| `qwen3.5-plus` / `qwen3.5-flash` | **On** | Disable with `enable_thinking: false` for simple tasks. |
| `qwen3-vl-plus` / `qwen3-vl-flash` | Off | Enable with `enable_thinking: true`. |
| `qvq-max` | Always on | **Streaming output required.** |
See [visual-reasoning.md](references/visual-reasoning.md) for details.
## OCR (qwen-vl-ocr)
Optimized for text extraction. Supports multi-language, skewed images, tables, formulas. See [ocr.md](references/ocr.md) for parameters and examples.
## Input Limits
**Images**: BMP/JPEG/PNG/TIFF/WEBP/HEIC. Min 10px sides, aspect ratio <= 200:1. Max 20 MB (URL, Qwen3.5) / 10 MB (others).
**Videos**: MP4/AVI/MKV/MOV/FLV/WMV. Duration 2s–2h (Qwen3.5) / 2s–10min (others). Max 2 GB (URL) / 10 MB (base64). fps range [0.1, 10], default 2.0.
## Error Handling
| HTTP | Meaning | Action |
|------|---------|--------|
| 401 | Invalid or missing API key | Run **qwencloud-ops-auth** if available; else prompt user to set key (non-plaintext check only) |
| 400 | Bad request (invalid format) | Verify messages format and image URL/format |
| 429 | Rate limited | Retry with exponential backoff |
| 5xx | Server error | Retry with exponential backoff |
## Output Location
Prefer the **current working directory**. Default subdirectory: `./output/qwencloud-vision/`.
**Write prohibition**: NEVER write output files into this skill's installation directory or any `skills/` hierarchy. All generated content must go to `output/` under the current working directory or a user-specified path.
## Update Check (MANDATORY Post-Execution)
After script execution, check stderr for action signals. The script automatically detects update status and emits structured signals. You MUST act on them. This section also applies when the user explicitly asks to check for updates.
### Signal: `[ACTION_REQUIRED] qwencloud-update-check skill is not installed`
When stderr contains this signal, you MUST present the following options to the user and **wait for their response before proceeding**. Do NOT silently skip this.
> The update-check skill is not installed. It keeps you informed about new qwencloud/qwencloud-ai features and bug fixes.
>
> 1) **Install now** — I'll run the install command for you
> 2) **Skip for now** — do nothing this time
> 3) **Never remind me** — permanently suppress this prompt
After the user responds:
- User chooses **1** → run: `npx skills add QwenCloud/qwencloud-ai --skill qwencloud-update-check -y`, then run: `python3 <this-skill-scripts-dir>/gossamer.py --dismiss`
- User chooses **2** → run: `python3 <this-skill-scripts-dir>/gossamer.py --dismiss`, then continue normally
- User chooses **3** → run: `python3 <this-skill-scripts-dir>/gossamer.py --never-install`, then continue normally
### Signal: `[UPDATE_AVAILABLE]`
When stderr contains this signal, you MUST append a brief update notice to your response, including the version info and the update command shown in the stderr output.
### No signal in stderr
If stderr contains neither `[ACTION_REQUIRED]` nor `[UPDATE_AVAILABLE]`, no action is needed — the skill is installed and up to date (or cached within 24h).
### Explicit user request
When the user explicitly asks to check for updates (e.g. "check for updates", "check version"):
1. Look for `qwencloud-update-check/SKILL.md` in sibling skill directories.
2. If found — run: `python3 <qwencloud-update-check-dir>/scripts/check_update.py --print-response` and report the result.
3. If not found — present the install options above.
## References
- [execution-guide.md](references/execution-guide.md) — Fallback paths (curl, code generation, autonomous)
- [curl-examples.md](references/curl-examples.md) — Curl templates (base64, multi-image, video, OCR)
- [api-guide.md](references/api-guide.md) — API supplementary guide
- [visual-reasoning.md](references/visual-reasoning.md) — QVQ visual reasoning guide
- [ocr.md](references/ocr.md) — Qwen-VL-OCR text extraction guide
- [sources.md](references/sources.md) — Official documentation URLs
标签
skill
ai