Simple CSC
A training-free approach to Chinese Spelling Correction using LLMs as pure language models with beam search and distortion modeling.
Prerequisites
This skill is a usage guide for the simple-csc repository. Before using any commands or APIs described here, clone the repository and work from its root:
CODEBLOCK0
All paths referenced below (e.g., configs/, scripts/, data/, eval/, datasets/) are relative to this repository root. The repository contains the actual code, config files, data dictionaries, and scripts — this skill provides the knowledge of how to use them.
Quick Reference
Environment Setup
CODEBLOCK1
Qwen2/Qwen2.5 warning: Without flash-attn, set torch_dtype=torch.bfloat16 to avoid unexpected behavior.
Python API
CODEBLOCK2
Config Selection
| Config | Use Case |
|---|
| INLINECODE6 | Substitution-only CSC (v1.0.0 style) |
| INLINECODE7 |
Full C2EC with insert/delete support (v2.0.0) |
|
configs/demo_config.yaml | Same as c2ec_config, used by demo app |
Key difference: c2ec_config.yaml includes ROR (reorder), MIS (missing char), RED (redundant char) distortion types and length_immutable_chars data file.
Recommended Models
- - v2.0.0 (C2EC):
Qwen/Qwen2.5-7B or Qwen/Qwen2.5-14B — best performance/speed balance - v1.0.0 (CSC):
baichuan-inc/Baichuan2-13B-Base — best performance - Always prefer
Base models over Instruct/Chat variants
RESTful API Server
CODEBLOCK3
Endpoints:
- -
GET /health — health check - INLINECODE21 — INLINECODE22
CODEBLOCK4
For detailed API parameters, config options, evaluation pipeline, and dataset formats, see references/details.md.
Key Architecture Concepts
The approach works by:
- 1. Using an LLM as a pure language model (left-to-right generation)
- At each step, computing a distortion probability for each candidate token based on how "similar" it is to the observed (possibly erroneous) character
- Combining LM probability with distortion probability via beam search
- Distortion types encode the relationship between observed and candidate characters (identical, same pinyin, similar shape, etc.)
The prompted_model parameter adds a second probability source: a prompt-based LLM that scores candidates given the full input sentence as context, improving correction quality.
简单中文拼写纠正
一种无需训练的中文拼写纠正方法,利用大型语言模型作为纯语言模型,结合束搜索和失真建模。
前置条件
本技能是simple-csc仓库的使用指南。在使用本文描述的任何命令或API之前,请先克隆该仓库并在其根目录下操作:
bash
git clone https://github.com/Jacob-Zhou/simple-csc.git
cd simple-csc
以下所有路径引用(例如configs/、scripts/、data/、eval/、datasets/)均相对于该仓库根目录。该仓库包含实际代码、配置文件、数据字典和脚本——本技能提供如何使用它们的知识。
快速参考
环境设置
bash
标准设置(创建虚拟环境,安装依赖)
bash scripts/set_environment.sh
针对Qwen3模型
bash scripts/set
environmentqwen3.sh
推荐:安装flash-attn以获得更好的性能和更低的显存占用
pip install flash-attn --no-build-isolation
Qwen2/Qwen2.5警告:如果没有flash-attn,请设置torch_dtype=torch.bfloat16以避免意外行为。
Python API
python
import torch
from lmcsc import LMCorrector
corrector = LMCorrector(
model=Qwen/Qwen2.5-7B,
prompted_model=Qwen/Qwen2.5-7B, # 使用相同模型以节省显存
configpath=configs/c2ecconfig.yaml, # 或使用configs/default_config.yaml仅进行替换
torch_dtype=torch.bfloat16, # 针对没有flash-attn的Qwen2/2.5推荐使用
)
单句
outputs = corrector(完善农产品上行发展机智。)
=> [(完善农产品上行发展机制。,)]
批量处理
outputs = corrector([句子一, 句子二])
带上下文(相同长度的列表)
outputs = corrector([未挨前兆], contexts=[患者提问:])
流式处理(仅支持batch_size=1)
for output in corrector(完善农产品上行发展机智。, stream=True):
print(output[0][0], end=\r, flush=True)
配置选择
| 配置 | 使用场景 |
|---|
| configs/defaultconfig.yaml | 仅替换的中文拼写纠正(v1.0.0风格) |
| configs/c2ecconfig.yaml |
支持插入/删除的完整C2EC(v2.0.0) |
| configs/demo
config.yaml | 与c2ecconfig相同,供演示应用使用 |
主要区别:c2ecconfig.yaml包含ROR(重排序)、MIS(缺失字符)、RED(冗余字符)失真类型以及lengthimmutable_chars数据文件。
推荐模型
- - v2.0.0(C2EC):Qwen/Qwen2.5-7B或Qwen/Qwen2.5-14B——最佳性能/速度平衡
- v1.0.0(CSC):baichuan-inc/Baichuan2-13B-Base——最佳性能
- 始终优先选择Base模型而非Instruct/Chat变体
RESTful API服务器
bash
python api_server.py \
--model Qwen/Qwen2.5-7B \
--prompted_model Qwen/Qwen2.5-7B \
--configpath configs/c2ecconfig.yaml \
--host 127.0.0.1 --port 8000 --workers 1 --bf16
端点:
- - GET /health — 健康检查
- POST /correction — {input: ..., stream: false, contexts: null}
bash
非流式
curl -X POST http://127.0.0.1:8000/correction \
-H Content-Type: application/json \
-d {input: 完善农产品上行发展机智。}
带上下文
curl -X POST http://127.0.0.1:8000/correction \
-H Content-Type: application/json \
-d {input: 未挨前兆, contexts: 患者提问:}
有关详细的API参数、配置选项、评估流程和数据集格式,请参见references/details.md。
关键架构概念
该方法的工作原理是:
- 1. 将大型语言模型用作纯语言模型(从左到右生成)
- 在每一步,根据每个候选词元与观察到的(可能错误的)字符的相似度计算其失真概率
- 通过束搜索将语言模型概率与失真概率相结合
- 失真类型编码了观察字符与候选字符之间的关系(相同、同音、形近等)
prompted_model参数增加了第二个概率来源:一个基于提示的大型语言模型,在给定完整输入句子作为上下文的情况下对候选词元进行评分,从而提高纠正质量。