Data Source Verification

A systematic workflow for verifying that every data point in a research dataset can be traced back to its original source paper, figure, table, or text passage.

When to Use

- Building datasets from literature (CSV, JSON, tables)
Populating tables or plots with values from multiple papers
Reviewing existing datasets for data integrity
Before submitting any paper that includes compiled data

Core Rule

Every numerical value must be traceable to a specific location in the original paper. If you cannot find the value in the cited source, it is unverified and must be flagged — never included as confirmed data.

Data Provenance Chain

CODEBLOCK0

Every link in this chain must be auditable. If someone asks "where did this number come from?", the answer should be: paper X, Table Y, column Z — and we have the PDF to prove it.

Citation Source Management

Project Setup (`init`)

Create a Citation_Sources/ directory for the project:

CODEBLOCK1

CITATION.md Template

Every cited paper gets a CITATION.md file:

CODEBLOCK2

Adding a Source (`add`)

When adding a new citation:

1. Create the folder: INLINECODE3
Download the original PDF — always try to get the actual paper, not just the abstract
Download supplementary information if it contains data
Create CITATION.md from the template
Extract the specific values you need, recording exact table/figure/page locations
Mark the PDF status and verification status

Verification Workflow

Step 1: Collect with Provenance

When extracting data from a paper, record ALL of the following for each value:

CODEBLOCK3

Never record a value without filling in the Location, Data type, and Verified fields.

Step 2: Verify Against Original

For each data point:

1. Always download the original PDF — don't trust web scraping, abstracts, or secondary sources
Find the exact value in a table, figure, or text passage
Record where you found it — table number, figure number, page, equation
Note the measurement method — experimental technique, simulation, estimate
Check units — convert if needed, note the original units
Track the data type: DFT-calculated, experimentally measured, or derived (note assumptions)

If the paper is behind a paywall and you cannot verify:

- Mark as INLINECODE4
Note this limitation in CITATION.md

Step 3: Cross-Check the Full Chain

Verify consistency at every step:

CODEBLOCK4

Any mismatch at any step is a flag.

Step 4: Flag Problems

Mark any value with one of these status levels:

Status	Meaning	Action
INLINECODE5	Found exact value in cited paper at stated location	Include in dataset
INLINECODE6

Step 5: Flag Discrepancies

When multiple sources report different values for the same quantity:

- Record both values with their sources
Note the discrepancy explicitly (e.g., "B = 45 GPa (Author A, Table 2) vs B = 86 GPa (Author B, Fig. 3)")
Check if the difference is due to measurement method, sample preparation, or temperature
Let the user decide which value to use — do not silently pick one

Dataset Format

When building compiled datasets, always include provenance columns:

CSV format:
CODEBLOCK5

JSON format:

{
  "material": "Li6PS5Cl",
  "property": "thermal_conductivity",
  "value": 0.69,
  "unit": "W/m·K",
  "source": {
    "paper": "Cheng et al. 2021",
    "doi": "10.1002/smll.202101693",
    "location": "Table 2, row 5",
    "method": "TDTR",
    "dataType": "experimental",
    "verified": true
  }
}

Audit Workflow (`audit`)

Scan all CITATION.md files and generate a report:

1. List all unique sources in Citation_Sources/
For each source, check:

- PDF downloaded? (✅ or ❌) - CITATION.md complete? (all fields filled) - Values confirmed against PDF?

3. Generate audit summary:

CODEBLOCK7

4. Report findings — list verified, flagged, and misattributed values
Recommend action for each flagged value

Export (`export`)

Generate a summary table of all data values and their provenance:

CODEBLOCK8

Red Flags

Watch for these indicators of unreliable data:

- Value attributed to a paper but no specific table/figure cited
"Estimated from family properties" without a clear methodology
Values that appear in reviews but cannot be traced to original measurements
Round numbers that suggest estimation rather than measurement (e.g., 2800 m/s vs 2837 m/s)
Same value appearing in multiple papers without independent measurement
DFT values presented as experimental without noting the distinction
Discrepancies between different sources for the same quantity left unaddressed

Rules

1. Never assume a citation is correct — always verify against the original paper
Always download the PDF — don't trust abstracts, web scraping, or secondary sources
Secondary sources are not verification — a review paper citing a value does not confirm it
Flag immediately when a value cannot be found in its cited source
Track data type — distinguish DFT-calculated, experimentally measured, and derived values
Flag discrepancies — when two sources disagree, note both values and let the user decide
Prefer measured over estimated — clearly label the difference
Document everything — future researchers need the audit trail
When in doubt, exclude — a smaller verified dataset beats a larger unverified one

数据源验证

一种系统化工作流程，用于验证研究数据集中的每个数据点均可追溯至其原始来源论文、图表、表格或文本段落。

适用场景

- 从文献中构建数据集（CSV、JSON、表格）
用多篇论文的值填充表格或图表
审查现有数据集的数据完整性
在提交包含汇编数据的任何论文之前

核心规则

每个数值必须可追溯至原始论文中的具体位置。 若无法在引用的来源中找到该值，则视为未验证，必须标记——绝不可作为已确认数据纳入。

数据溯源链

源PDF → CITATION.md（提取值）→ CSV/数据表 → LaTeX手稿

此链中的每个环节必须可审计。若有人问这个数字从何而来？，答案应为：论文X，表格Y，列Z——且我们有PDF作为证据。

引用源管理

项目设置（init）

为项目创建Citation_Sources/目录：

Citation_Sources/
AuthorLastNameYearJournal_ShortTitle/
AuthorYearTopic.pdf ← 原始论文
AuthorYearTopic_SI.pdf ← 补充信息（如有）
CITATION.md ← 结构化元数据 + 数据溯源

CITATION.md模板

每篇引用的论文均需创建CITATION.md文件：

markdown

Author et al. Year — Short Description

标题：完整标题
作者：作者列表
期刊：期刊卷号，页码（年份）
DOI：10.xxxx/xxxxx
使用的数据：[提取的精确值，附表格/图表引用]
PDF：✅ 已确认 | ❌ 未下载 — [原因]
状态：已确认 | ⚠️ 需确认 — [原因]
备注：[任何注意事项、差异、代理假设]

添加来源（add）

添加新引用时：

1. 创建文件夹：CitationSources/AuthorLastNameYearJournalShortTitle/
下载原始PDF——始终尝试获取实际论文，而非仅摘要
若补充信息包含数据，则一并下载
根据模板创建CITATION.md
提取所需的具体值，记录精确的表格/图表/页码位置
标记PDF状态和验证状态

验证工作流程

步骤1：带溯源收集

从论文中提取数据时，为每个值记录以下所有信息：

值：0.65 W/m·K
论文：Cheng et al. 2021
DOI：10.1002/smll.202101693
位置：表2，第3行
方法：TDTR（时域热反射法）
数据类型：实验
已验证：是——在表2中确认该值

绝不可在未填写位置、数据类型和已验证字段的情况下记录值。

步骤2：对照原始来源验证

对于每个数据点：

1. 始终下载原始PDF——不要信任网页抓取、摘要或二手来源
在表格、图表或文本段落中找到精确值
记录找到的位置——表格编号、图表编号、页码、公式
记录测量方法——实验技术、模拟、估算
检查单位——如有需要则转换，记录原始单位
追踪数据类型：DFT计算、实验测量或推导（注明假设）

若论文受付费墙限制而无法验证：

- 标记为⚠️ 需确认 — 付费墙
在CITATION.md中注明此限制

步骤3：交叉检查完整链条

在每个步骤验证一致性：

PDF中的值 → CITATION.md中的值 → 数据表/CSV中的值 → 手稿中的值

任何步骤中的任何不匹配均为标记项。

步骤4：标记问题

为每个值标记以下状态级别之一：

状态	含义	操作
已验证	在引用的论文中指定位置找到精确值	纳入数据集
近似值

步骤5：标记差异

当多个来源对同一量报告不同值时：

- 记录两个值及其来源
明确注明差异（例如B = 45 GPa（作者A，表2）vs B = 86 GPa（作者B，图3））
检查差异是否源于测量方法、样品制备或温度
让用户决定使用哪个值——不要默默选择其中一个

数据集格式

构建汇编数据集时，始终包含溯源列：

CSV格式：
csv
材料,属性,值,单位,来源论文,DOI,来源位置,方法,数据类型,已验证,备注
Li6PS5Cl,kappa,0.69,W/m·K,Cheng 2021,10.1002/smll.202101693,表2,TDTR,实验,是,
Li3InCl6,v_longitudinal,2800,m/s,Asano 2018,10.1002/adma.201803075,未找到,未知,未知,归属错误,论文中无Li3InCl6声速数据

JSON格式：
json
{
material: Li6PS5Cl,
property: thermal_conductivity,
value: 0.69,
unit: W/m·K,
source: {
paper: Cheng et al. 2021,
doi: 10.1002/smll.202101693,
location: 表2，第5行,
method: TDTR,
dataType: experimental,
verified: true
}
}

审计工作流程（audit）

扫描所有CITATION.md文件并生成报告：

1. 列出所有唯一来源（在Citation_Sources/中）
对每个来源，检查：

- PDF已下载？（✅ 或 ❌） - CITATION.md完整？（所有字段已填写） - 值已对照PDF确认？

3. 生成审计摘要：

markdown

审计报告 — [项目名称]

日期：[时间戳]

摘要

- 总来源数：[N]
PDF已确认：[N] / [N]
值已验证：[N] / [N]
需确认：[N]
缺失PDF：[N]

来源详情

论文	PDF	值	已验证	状态
Cheng 2021	✅	3	3/3	已确认
Asano 2018

✅ | 2 | 1/2 | ⚠️ 1个归属错误 | | Wang 2014 | ❌ | 4 | 0/4 | ⚠️ 需确认 |

标记值

- Li3InCl6 v_longitudinal：归属错误至Asano 2018 — 论文中无LIC数据
LGPS密度：Wang 2014与Kamaya 2011之间存在冲突值（2.0 vs 1.9 g/cm³）

4. 报告发现——列出已验证、已标记和归属错误的值
为每个标记值建议操作

导出（export）

生成所有数据值及其溯源的摘要表：

markdown

数据溯源摘要 — [项目名称]

材料	属性	值	单位	来源	位置	数据类型	状态
LLZTO	κ	0.42	W/m·K	Muy 2019	表1	实验	已验证
LAGP

v_avg | 4700 | m/s | Rohde 2021 | 表S2 | 实验 | 已验证 |
| Li3InCl6 | v_avg | 1849 | m/s | Qiu 2025 | 表1 | DFT | 已验证 |

警示标志

注意以下不可靠数据的指示标志：

- 值归因于某篇论文但未引用具体表格/图表
根据家族性质估算但无明确方法
出现在综述中但无法追溯至原始测量的值
暗示估算而非测量的整数（例如2800 m/s vs 2837 m/s）
同一值出现在多篇论文中但无独立测量
DFT值被呈现为实验值而未注明区别
不同来源对同一量的

data-source-verification数据源验证