Feishu Knowledge Ingest

Use this skill to turn a Feishu folder or a single shared attachment into structured, reviewable knowledge outputs.

What this skill does

- Accept a Feishu folder link/token or a single shared attachment.
Classify files into direct-read, download-and-parse, manual-review, or permission-blocked.
Parse .docx and .pdf in v0.1.
Produce report-first outputs instead of writing MEMORY.md directly.
Preserve failures and uncertainty instead of guessing content.

Supported v0.1 scope

Inputs

- Feishu folder link or INLINECODE3
Single shared attachment link or token

Parsing

- INLINECODE4
INLINECODE5

Outputs

- INLINECODE6
INLINECODE7
INLINECODE8
INLINECODE9

Required behavior

1. Distinguish Feishu native docs from uploaded attachments.

- Native docs: doc, sheet, wiki, bitable - Uploaded attachments: .docx, .pdf, .pptx, other files

2. Do not claim attachment content was learned unless text was actually extracted.
Default to report-first. Do not write MEMORY.md in v0.1.
Record every failed file with a concrete reason.
Prefer plain-text summaries over complex Feishu cards when reporting progress.

File routing rules

Direct-read

Treat these as direct-read only when the runtime has a reliable native-reader path:

- INLINECODE18
INLINECODE19
INLINECODE20
INLINECODE21

Download-and-parse

Treat these as download-and-parse:

- INLINECODE22
INLINECODE23

Manual-review

Route here when the file is out of scope or low-confidence in v0.1:

- INLINECODE24
images
scans with no extractable text
archives
unusual file types

Permission-blocked

Route here when listing is possible but the file cannot be downloaded or read.

Standard workflow

1. Resolve input type.

- Folder link/token -> enumerate files. - Single file link/token -> build a one-file manifest.

2. Create a batch record.

- Generate batch_id. - Record started_at.

3. Build a manifest.

- File name - File token/link - file type - route decision

4. Attempt extraction.

- .docx -> use parsers/parse_docx.py - .pdf -> use parsers/parse_pdf.py

5. Produce structured outputs.

- success -> append to kb-items.jsonl - failure -> append to failed-items.jsonl

6. Summarize the batch.

- Write ingest-report.md - Write MEMORY.candidate.md

7. Finish the batch.

- Record finished_at - Never auto-write INLINECODE36

Output contracts

kb-items.jsonl

Write one JSON object per successfully extracted knowledge item with at least:

- INLINECODE37
INLINECODE38
INLINECODE39
INLINECODE40
INLINECODE41
INLINECODE42
INLINECODE43
INLINECODE44
INLINECODE45

failed-items.jsonl

Write one JSON object per failed or blocked file with at least:

- INLINECODE46
INLINECODE47
INLINECODE48
INLINECODE49
INLINECODE50
INLINECODE51
INLINECODE52
INLINECODE53

MEMORY.candidate.md

Include:

- batch header (batch_id, started_at, finished_at, source_directory or source_file)
grouped knowledge summaries
source references
confidence notes
items needing review

ingest-report.md

Include:

1. Batch summary
Input scope
File counts and routing counts
Successful extraction summary
Failures and risks
Recommended next actions

Safety rules

- Never invent text that was not extracted.
If parsing fails, say so plainly and log it.
Treat filenames as hints only, never as proof of document contents.
Keep sensitive data out of MEMORY.candidate.md unless the workflow explicitly allows it.

Included files

- run.py: minimal batch runner for local testing
INLINECODE61: docx text extraction helper
INLINECODE62: pdf text extraction helper
INLINECODE63: sample output shapes and field guidance
INLINECODE64: setup and usage notes

技能名称: feishu-knowledge-ingest
详细描述:

飞书知识摄入

使用此技能将飞书文件夹或单个共享附件转化为结构化的、可审阅的知识输出。

此技能的功能

- 接受飞书文件夹链接/令牌或单个共享附件。
将文件分类为：直接读取、下载并解析、人工审阅、权限受限。
在v0.1版本中解析.docx和.pdf文件。
优先生成报告输出，而非直接写入MEMORY.md。
保留失败和不确定情况，而非猜测内容。

v0.1版本支持范围

输入

- 飞书文件夹链接或folder_token
单个共享附件链接或令牌

解析

- .docx
.pdf

输出

- ingest-report.md
kb-items.jsonl
failed-items.jsonl
MEMORY.candidate.md

必需行为

1. 区分飞书原生文档与上传附件。

- 原生文档：doc、sheet、wiki、bitable - 上传附件：.docx、.pdf、.pptx及其他文件

2. 除非实际提取了文本，否则不得声称已学习附件内容。
默认优先输出报告。在v0.1版本中不写入MEMORY.md。
记录每个失败文件的具体原因。
报告进度时，优先使用纯文本摘要而非复杂的飞书卡片。

文件路由规则

直接读取

仅在运行时具有可靠的原生读取路径时，才将以下类型视为直接读取：

- doc
sheet
wiki
bitable

下载并解析

将以下类型视为下载并解析：

- .docx
.pdf

人工审阅

当文件超出范围或在v0.1版本中置信度较低时，路由至此：

- .pptx
图片
无可提取文本的扫描件
压缩包
不常见文件类型

权限受限

当可以列出文件但无法下载或读取时，路由至此。

标准工作流程

1. 解析输入类型。

- 文件夹链接/令牌 -> 枚举文件。 - 单个文件链接/令牌 -> 构建单文件清单。

2. 创建批次记录。

- 生成batch_id。 - 记录started_at。

3. 构建清单。

- 文件名 - 文件令牌/链接 - 文件类型 - 路由决策

4. 尝试提取。

- .docx -> 使用parsers/parse_docx.py - .pdf -> 使用parsers/parse_pdf.py

5. 生成结构化输出。

- 成功 -> 追加至kb-items.jsonl - 失败 -> 追加至failed-items.jsonl

6. 汇总批次。

- 写入ingest-report.md - 写入MEMORY.candidate.md

7. 完成批次。

- 记录finished_at - 绝不自动写入MEMORY.md

输出规范

kb-items.jsonl

每个成功提取的知识项写入一个JSON对象，至少包含：

- batchid
sourcefile
sourcetoken
filetype
topic
contenttype
summary
extractedat
confidence

failed-items.jsonl

每个失败或受阻的文件写入一个JSON对象，至少包含：

- batchid
sourcefile
sourcetoken
filetype
failurereason
errordetail
suggestedaction
failedat

MEMORY.candidate.md

包含：

- 批次头部（batchid、startedat、finishedat、sourcedirectory或source_file）
分组的知识摘要
来源引用
置信度说明
需要审阅的项目

ingest-report.md

包含：

1. 批次摘要
输入范围
文件数量及路由统计
成功提取摘要
失败与风险
建议后续操作

安全规则

- 绝不编造未提取的文本。
若解析失败，如实说明并记录。
仅将文件名视为提示，绝不作为文档内容的证据。
除非工作流明确允许，否则不得将敏感数据放入MEMORY.candidate.md。

包含文件

- run.py：用于本地测试的最小批次运行器
parsers/parsedocx.py：docx文本提取辅助工具
parsers/parsepdf.py：pdf文本提取辅助工具
references/output_examples.md：示例输出格式及字段指南
README.md：设置与使用说明

feishu-knowledge-ingest飞书知识导入