Feishu Knowledge Ingest
Use this skill to turn a Feishu folder or a single shared attachment into structured, reviewable knowledge outputs.
What this skill does
- - Accept a Feishu folder link/token or a single shared attachment.
- Classify files into direct-read, download-and-parse, manual-review, or permission-blocked.
- Parse
.docx and .pdf in v0.1. - Produce report-first outputs instead of writing
MEMORY.md directly. - Preserve failures and uncertainty instead of guessing content.
Supported v0.1 scope
Inputs
- - Feishu folder link or INLINECODE3
- Single shared attachment link or token
Parsing
Outputs
- - INLINECODE6
- INLINECODE7
- INLINECODE8
- INLINECODE9
Required behavior
- 1. Distinguish Feishu native docs from uploaded attachments.
- Native docs:
doc,
sheet,
wiki,
bitable
- Uploaded attachments:
.docx,
.pdf,
.pptx, other files
- 2. Do not claim attachment content was learned unless text was actually extracted.
- Default to report-first. Do not write
MEMORY.md in v0.1. - Record every failed file with a concrete reason.
- Prefer plain-text summaries over complex Feishu cards when reporting progress.
File routing rules
Direct-read
Treat these as direct-read only when the runtime has a reliable native-reader path:
- - INLINECODE18
- INLINECODE19
- INLINECODE20
- INLINECODE21
Download-and-parse
Treat these as download-and-parse:
- - INLINECODE22
- INLINECODE23
Manual-review
Route here when the file is out of scope or low-confidence in v0.1:
- - INLINECODE24
- images
- scans with no extractable text
- archives
- unusual file types
Permission-blocked
Route here when listing is possible but the file cannot be downloaded or read.
Standard workflow
- 1. Resolve input type.
- Folder link/token -> enumerate files.
- Single file link/token -> build a one-file manifest.
- 2. Create a batch record.
- Generate
batch_id.
- Record
started_at.
- 3. Build a manifest.
- File name
- File token/link
- file type
- route decision
- 4. Attempt extraction.
-
.docx -> use
parsers/parse_docx.py
-
.pdf -> use
parsers/parse_pdf.py
- 5. Produce structured outputs.
- success -> append to
kb-items.jsonl
- failure -> append to
failed-items.jsonl
- 6. Summarize the batch.
- Write
ingest-report.md
- Write
MEMORY.candidate.md
- 7. Finish the batch.
- Record
finished_at
- Never auto-write INLINECODE36
Output contracts
kb-items.jsonl
Write one JSON object per successfully extracted knowledge item with at least:
- - INLINECODE37
- INLINECODE38
- INLINECODE39
- INLINECODE40
- INLINECODE41
- INLINECODE42
- INLINECODE43
- INLINECODE44
- INLINECODE45
failed-items.jsonl
Write one JSON object per failed or blocked file with at least:
- - INLINECODE46
- INLINECODE47
- INLINECODE48
- INLINECODE49
- INLINECODE50
- INLINECODE51
- INLINECODE52
- INLINECODE53
MEMORY.candidate.md
Include:
- - batch header (
batch_id, started_at, finished_at, source_directory or source_file) - grouped knowledge summaries
- source references
- confidence notes
- items needing review
ingest-report.md
Include:
- 1. Batch summary
- Input scope
- File counts and routing counts
- Successful extraction summary
- Failures and risks
- Recommended next actions
Safety rules
- - Never invent text that was not extracted.
- If parsing fails, say so plainly and log it.
- Treat filenames as hints only, never as proof of document contents.
- Keep sensitive data out of
MEMORY.candidate.md unless the workflow explicitly allows it.
Included files
- -
run.py: minimal batch runner for local testing - INLINECODE61 : docx text extraction helper
- INLINECODE62 : pdf text extraction helper
- INLINECODE63 : sample output shapes and field guidance
- INLINECODE64 : setup and usage notes
技能名称: feishu-knowledge-ingest
详细描述:
飞书知识摄入
使用此技能将飞书文件夹或单个共享附件转化为结构化的、可审阅的知识输出。
此技能的功能
- - 接受飞书文件夹链接/令牌或单个共享附件。
- 将文件分类为:直接读取、下载并解析、人工审阅、权限受限。
- 在v0.1版本中解析.docx和.pdf文件。
- 优先生成报告输出,而非直接写入MEMORY.md。
- 保留失败和不确定情况,而非猜测内容。
v0.1版本支持范围
输入
- - 飞书文件夹链接或folder_token
- 单个共享附件链接或令牌
解析
输出
- - ingest-report.md
- kb-items.jsonl
- failed-items.jsonl
- MEMORY.candidate.md
必需行为
- 1. 区分飞书原生文档与上传附件。
- 原生文档:doc、sheet、wiki、bitable
- 上传附件:.docx、.pdf、.pptx及其他文件
- 2. 除非实际提取了文本,否则不得声称已学习附件内容。
- 默认优先输出报告。在v0.1版本中不写入MEMORY.md。
- 记录每个失败文件的具体原因。
- 报告进度时,优先使用纯文本摘要而非复杂的飞书卡片。
文件路由规则
直接读取
仅在运行时具有可靠的原生读取路径时,才将以下类型视为直接读取:
下载并解析
将以下类型视为下载并解析:
人工审阅
当文件超出范围或在v0.1版本中置信度较低时,路由至此:
- - .pptx
- 图片
- 无可提取文本的扫描件
- 压缩包
- 不常见文件类型
权限受限
当可以列出文件但无法下载或读取时,路由至此。
标准工作流程
- 1. 解析输入类型。
- 文件夹链接/令牌 -> 枚举文件。
- 单个文件链接/令牌 -> 构建单文件清单。
- 2. 创建批次记录。
- 生成batch_id。
- 记录started_at。
- 3. 构建清单。
- 文件名
- 文件令牌/链接
- 文件类型
- 路由决策
- 4. 尝试提取。
- .docx -> 使用parsers/parse_docx.py
- .pdf -> 使用parsers/parse_pdf.py
- 5. 生成结构化输出。
- 成功 -> 追加至kb-items.jsonl
- 失败 -> 追加至failed-items.jsonl
- 6. 汇总批次。
- 写入ingest-report.md
- 写入MEMORY.candidate.md
- 7. 完成批次。
- 记录finished_at
- 绝不自动写入MEMORY.md
输出规范
kb-items.jsonl
每个成功提取的知识项写入一个JSON对象,至少包含:
- - batchid
- sourcefile
- sourcetoken
- filetype
- topic
- contenttype
- summary
- extractedat
- confidence
failed-items.jsonl
每个失败或受阻的文件写入一个JSON对象,至少包含:
- - batchid
- sourcefile
- sourcetoken
- filetype
- failurereason
- errordetail
- suggestedaction
- failedat
MEMORY.candidate.md
包含:
- - 批次头部(batchid、startedat、finishedat、sourcedirectory或source_file)
- 分组的知识摘要
- 来源引用
- 置信度说明
- 需要审阅的项目
ingest-report.md
包含:
- 1. 批次摘要
- 输入范围
- 文件数量及路由统计
- 成功提取摘要
- 失败与风险
- 建议后续操作
安全规则
- - 绝不编造未提取的文本。
- 若解析失败,如实说明并记录。
- 仅将文件名视为提示,绝不作为文档内容的证据。
- 除非工作流明确允许,否则不得将敏感数据放入MEMORY.candidate.md。
包含文件
- - run.py:用于本地测试的最小批次运行器
- parsers/parsedocx.py:docx文本提取辅助工具
- parsers/parsepdf.py:pdf文本提取辅助工具
- references/output_examples.md:示例输出格式及字段指南
- README.md:设置与使用说明