agent-survey-corpus

# Agent Survey Corpus (arXiv PDFs → text extracts) Goal: create a small, local reference library so you can **learn from real agent surveys** when refining: - C2 outline structure (paper-like sectioning) - C4 tables/claims organization - C5 writing style and density This is intentionally *not* part of the pipeline; it is an optional, repo-level toolkit. ## Inputs - `ref/agent-surveys/arxiv_ids.txt` ## Outputs - `ref/agent-surveys/pdfs/` - `ref/agent-surveys/text/` - `ref/agent-surveys/STYLE_REPORT.md` (tracked; auto-generated summary) ## Workflow 1) Edit `ref/agent-surveys/arxiv_ids.txt` (one arXiv id per line). 2) Run the downloader to fetch PDFs and extract the first N pages to text. 3) Skim the extracted text under `ref/agent-surveys/text/`: - look at section counts (H2), subsection granularity (H3), and how they transition between chapters. - identify repeated rhetorical patterns you want the pipeline writer to imitate. ## Script ### Quick Start - `python scripts/run.py --help` - `python scripts/run.py --workspace . --max-pages 20` ### All Options - `--workspace <dir>` (use `.` to write into repo root) - `--inputs <semicolon-separated>` (default: `ref/agent-surveys/arxiv_ids.txt`) - `--max-pages <N>` (default: 20) - `--sleep <seconds>` (default: 1.0) - `--overwrite` (re-download + re-extract) ### Examples - Download/extract into repo root `ref/`: - `python scripts/run.py --workspace . --max-pages 20` - Download/extract into a specific folder (treated as workspace root): - `python scripts/run.py --workspace /tmp/surveys --max-pages 30` ## Troubleshooting - **Download fails / timeout**: rerun with a larger `--sleep`, or try fewer ids. - **Text extract is empty**: the PDF may be scanned; try another survey or increase `--max-pages`. - **Files showing up in git status**: PDFs/text are ignored via `.gitignore` (`ref/**/pdfs/`, `ref/**/text/`).

agent-survey-corpus

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

agent-survey-corpus