Data Source Verification
A systematic workflow for verifying that every data point in a research dataset can be traced back to its original source paper, figure, table, or text passage.
When to Use
- - Building datasets from literature (CSV, JSON, tables)
- Populating tables or plots with values from multiple papers
- Reviewing existing datasets for data integrity
- Before submitting any paper that includes compiled data
Core Rule
Every numerical value must be traceable to a specific location in the original paper. If you cannot find the value in the cited source, it is unverified and must be flagged — never included as confirmed data.
Data Provenance Chain
CODEBLOCK0
Every link in this chain must be auditable. If someone asks "where did this number come from?", the answer should be: paper X, Table Y, column Z — and we have the PDF to prove it.
Citation Source Management
Project Setup (init)
Create a Citation_Sources/ directory for the project:
CODEBLOCK1
CITATION.md Template
Every cited paper gets a CITATION.md file:
CODEBLOCK2
Adding a Source (add)
When adding a new citation:
- 1. Create the folder: INLINECODE3
- Download the original PDF — always try to get the actual paper, not just the abstract
- Download supplementary information if it contains data
- Create CITATION.md from the template
- Extract the specific values you need, recording exact table/figure/page locations
- Mark the PDF status and verification status
Verification Workflow
Step 1: Collect with Provenance
When extracting data from a paper, record ALL of the following for each value:
CODEBLOCK3
Never record a value without filling in the Location, Data type, and Verified fields.
Step 2: Verify Against Original
For each data point:
- 1. Always download the original PDF — don't trust web scraping, abstracts, or secondary sources
- Find the exact value in a table, figure, or text passage
- Record where you found it — table number, figure number, page, equation
- Note the measurement method — experimental technique, simulation, estimate
- Check units — convert if needed, note the original units
- Track the data type: DFT-calculated, experimentally measured, or derived (note assumptions)
If the paper is behind a paywall and you cannot verify:
- - Mark as INLINECODE4
- Note this limitation in CITATION.md
Step 3: Cross-Check the Full Chain
Verify consistency at every step:
CODEBLOCK4
Any mismatch at any step is a flag.
Step 4: Flag Problems
Mark any value with one of these status levels:
| Status | Meaning | Action |
|---|
| INLINECODE5 | Found exact value in cited paper at stated location | Include in dataset |
| INLINECODE6 |
Value is close but not exact (e.g., read from figure) | Include with note |
|
UNVERIFIED | Cannot find value in cited paper | Flag — do not use without user approval |
|
MISATTRIBUTED | Cited paper does not contain this data at all | Remove from dataset, alert user immediately |
|
ESTIMATED | Value was calculated or estimated, not directly measured | Include with clear label |
|
⚠️ NEEDS CONFIRM | PDF not available (paywall) or value needs double-check | Flag for manual verification |
Step 5: Flag Discrepancies
When multiple sources report different values for the same quantity:
- - Record both values with their sources
- Note the discrepancy explicitly (e.g., "B = 45 GPa (Author A, Table 2) vs B = 86 GPa (Author B, Fig. 3)")
- Check if the difference is due to measurement method, sample preparation, or temperature
- Let the user decide which value to use — do not silently pick one
Dataset Format
When building compiled datasets, always include provenance columns:
CSV format:
CODEBLOCK5
JSON format:
{
"material": "Li6PS5Cl",
"property": "thermal_conductivity",
"value": 0.69,
"unit": "W/m·K",
"source": {
"paper": "Cheng et al. 2021",
"doi": "10.1002/smll.202101693",
"location": "Table 2, row 5",
"method": "TDTR",
"dataType": "experimental",
"verified": true
}
}
Audit Workflow (audit)
Scan all CITATION.md files and generate a report:
- 1. List all unique sources in Citation_Sources/
- For each source, check:
- PDF downloaded? (✅ or ❌)
- CITATION.md complete? (all fields filled)
- Values confirmed against PDF?
- 3. Generate audit summary:
CODEBLOCK7
- 4. Report findings — list verified, flagged, and misattributed values
- Recommend action for each flagged value
Export (export)
Generate a summary table of all data values and their provenance:
CODEBLOCK8
Red Flags
Watch for these indicators of unreliable data:
- - Value attributed to a paper but no specific table/figure cited
- "Estimated from family properties" without a clear methodology
- Values that appear in reviews but cannot be traced to original measurements
- Round numbers that suggest estimation rather than measurement (e.g., 2800 m/s vs 2837 m/s)
- Same value appearing in multiple papers without independent measurement
- DFT values presented as experimental without noting the distinction
- Discrepancies between different sources for the same quantity left unaddressed
Rules
- 1. Never assume a citation is correct — always verify against the original paper
- Always download the PDF — don't trust abstracts, web scraping, or secondary sources
- Secondary sources are not verification — a review paper citing a value does not confirm it
- Flag immediately when a value cannot be found in its cited source
- Track data type — distinguish DFT-calculated, experimentally measured, and derived values
- Flag discrepancies — when two sources disagree, note both values and let the user decide
- Prefer measured over estimated — clearly label the difference
- Document everything — future researchers need the audit trail
- When in doubt, exclude — a smaller verified dataset beats a larger unverified one
数据源验证
一种系统化工作流程,用于验证研究数据集中的每个数据点均可追溯至其原始来源论文、图表、表格或文本段落。
适用场景
- - 从文献中构建数据集(CSV、JSON、表格)
- 用多篇论文的值填充表格或图表
- 审查现有数据集的数据完整性
- 在提交包含汇编数据的任何论文之前
核心规则
每个数值必须可追溯至原始论文中的具体位置。 若无法在引用的来源中找到该值,则视为未验证,必须标记——绝不可作为已确认数据纳入。
数据溯源链
源PDF → CITATION.md(提取值)→ CSV/数据表 → LaTeX手稿
此链中的每个环节必须可审计。若有人问这个数字从何而来?,答案应为:论文X,表格Y,列Z——且我们有PDF作为证据。
引用源管理
项目设置(init)
为项目创建Citation_Sources/目录:
Citation_Sources/
AuthorLastNameYearJournal_ShortTitle/
AuthorYearTopic.pdf ← 原始论文
AuthorYearTopic_SI.pdf ← 补充信息(如有)
CITATION.md ← 结构化元数据 + 数据溯源
CITATION.md模板
每篇引用的论文均需创建CITATION.md文件:
markdown
Author et al. Year — Short Description
标题:完整标题
作者:作者列表
期刊:期刊卷号,页码(年份)
DOI:10.xxxx/xxxxx
使用的数据:[提取的精确值,附表格/图表引用]
PDF:✅ 已确认 | ❌ 未下载 — [原因]
状态:已确认 | ⚠️ 需确认 — [原因]
备注:[任何注意事项、差异、代理假设]
添加来源(add)
添加新引用时:
- 1. 创建文件夹:CitationSources/AuthorLastNameYearJournalShortTitle/
- 下载原始PDF——始终尝试获取实际论文,而非仅摘要
- 若补充信息包含数据,则一并下载
- 根据模板创建CITATION.md
- 提取所需的具体值,记录精确的表格/图表/页码位置
- 标记PDF状态和验证状态
验证工作流程
步骤1:带溯源收集
从论文中提取数据时,为每个值记录以下所有信息:
值:0.65 W/m·K
论文:Cheng et al. 2021
DOI:10.1002/smll.202101693
位置:表2,第3行
方法:TDTR(时域热反射法)
数据类型:实验
已验证:是——在表2中确认该值
绝不可在未填写位置、数据类型和已验证字段的情况下记录值。
步骤2:对照原始来源验证
对于每个数据点:
- 1. 始终下载原始PDF——不要信任网页抓取、摘要或二手来源
- 在表格、图表或文本段落中找到精确值
- 记录找到的位置——表格编号、图表编号、页码、公式
- 记录测量方法——实验技术、模拟、估算
- 检查单位——如有需要则转换,记录原始单位
- 追踪数据类型:DFT计算、实验测量或推导(注明假设)
若论文受付费墙限制而无法验证:
- - 标记为⚠️ 需确认 — 付费墙
- 在CITATION.md中注明此限制
步骤3:交叉检查完整链条
在每个步骤验证一致性:
PDF中的值 → CITATION.md中的值 → 数据表/CSV中的值 → 手稿中的值
任何步骤中的任何不匹配均为标记项。
步骤4:标记问题
为每个值标记以下状态级别之一:
| 状态 | 含义 | 操作 |
|---|
| 已验证 | 在引用的论文中指定位置找到精确值 | 纳入数据集 |
| 近似值 |
值接近但不精确(例如从图表中读取) | 附注纳入 |
| 未验证 | 在引用的论文中找不到该值 | 标记——未经用户批准不得使用 |
| 归属错误 | 引用的论文根本不包含此数据 | 从数据集中移除,立即提醒用户 |
| 估算值 | 值为计算或估算所得,非直接测量 | 附明确标签纳入 |
| ⚠️ 需确认 | PDF不可用(付费墙)或值需复核 | 标记以待人工验证 |
步骤5:标记差异
当多个来源对同一量报告不同值时:
- - 记录两个值及其来源
- 明确注明差异(例如B = 45 GPa(作者A,表2)vs B = 86 GPa(作者B,图3))
- 检查差异是否源于测量方法、样品制备或温度
- 让用户决定使用哪个值——不要默默选择其中一个
数据集格式
构建汇编数据集时,始终包含溯源列:
CSV格式:
csv
材料,属性,值,单位,来源论文,DOI,来源位置,方法,数据类型,已验证,备注
Li6PS5Cl,kappa,0.69,W/m·K,Cheng 2021,10.1002/smll.202101693,表2,TDTR,实验,是,
Li3InCl6,v_longitudinal,2800,m/s,Asano 2018,10.1002/adma.201803075,未找到,未知,未知,归属错误,论文中无Li3InCl6声速数据
JSON格式:
json
{
material: Li6PS5Cl,
property: thermal_conductivity,
value: 0.69,
unit: W/m·K,
source: {
paper: Cheng et al. 2021,
doi: 10.1002/smll.202101693,
location: 表2,第5行,
method: TDTR,
dataType: experimental,
verified: true
}
}
审计工作流程(audit)
扫描所有CITATION.md文件并生成报告:
- 1. 列出所有唯一来源(在Citation_Sources/中)
- 对每个来源,检查:
- PDF已下载?(✅ 或 ❌)
- CITATION.md完整?(所有字段已填写)
- 值已对照PDF确认?
- 3. 生成审计摘要:
markdown
审计报告 — [项目名称]
日期:[时间戳]
摘要
- - 总来源数:[N]
- PDF已确认:[N] / [N]
- 值已验证:[N] / [N]
- 需确认:[N]
- 缺失PDF:[N]
来源详情
| 论文 | PDF | 值 | 已验证 | 状态 |
|---|
| Cheng 2021 | ✅ | 3 | 3/3 | 已确认 |
| Asano 2018 |
✅ | 2 | 1/2 | ⚠️ 1个归属错误 |
| Wang 2014 | ❌ | 4 | 0/4 | ⚠️ 需确认 |
标记值
- - Li3InCl6 v_longitudinal:归属错误至Asano 2018 — 论文中无LIC数据
- LGPS密度:Wang 2014与Kamaya 2011之间存在冲突值(2.0 vs 1.9 g/cm³)
- 4. 报告发现——列出已验证、已标记和归属错误的值
- 为每个标记值建议操作
导出(export)
生成所有数据值及其溯源的摘要表:
markdown
数据溯源摘要 — [项目名称]
| 材料 | 属性 | 值 | 单位 | 来源 | 位置 | 数据类型 | 状态 |
|---|
| LLZTO | κ | 0.42 | W/m·K | Muy 2019 | 表1 | 实验 | 已验证 |
| LAGP |
v_avg | 4700 | m/s | Rohde 2021 | 表S2 | 实验 | 已验证 |
| Li3InCl6 | v_avg | 1849 | m/s | Qiu 2025 | 表1 | DFT | 已验证 |
警示标志
注意以下不可靠数据的指示标志:
- - 值归因于某篇论文但未引用具体表格/图表
- 根据家族性质估算但无明确方法
- 出现在综述中但无法追溯至原始测量的值
- 暗示估算而非测量的整数(例如2800 m/s vs 2837 m/s)
- 同一值出现在多篇论文中但无独立测量
- DFT值被呈现为实验值而未注明区别
- 不同来源对同一量的