Simple CSC

A training-free approach to Chinese Spelling Correction using LLMs as pure language models with beam search and distortion modeling.

Prerequisites

This skill is a usage guide for the simple-csc repository. Before using any commands or APIs described here, clone the repository and work from its root:

CODEBLOCK0

All paths referenced below (e.g., configs/, scripts/, data/, eval/, datasets/) are relative to this repository root. The repository contains the actual code, config files, data dictionaries, and scripts — this skill provides the knowledge of how to use them.

Quick Reference

Environment Setup

CODEBLOCK1

Qwen2/Qwen2.5 warning: Without flash-attn, set torch_dtype=torch.bfloat16 to avoid unexpected behavior.

Python API

CODEBLOCK2

Config Selection

Config	Use Case
INLINECODE6	Substitution-only CSC (v1.0.0 style)
INLINECODE7

Full C2EC with insert/delete support (v2.0.0) | | configs/demo_config.yaml | Same as c2ec_config, used by demo app |

Key difference: c2ec_config.yaml includes ROR (reorder), MIS (missing char), RED (redundant char) distortion types and length_immutable_chars data file.

Recommended Models

- v2.0.0 (C2EC): Qwen/Qwen2.5-7B or Qwen/Qwen2.5-14B — best performance/speed balance
v1.0.0 (CSC): baichuan-inc/Baichuan2-13B-Base — best performance
Always prefer Base models over Instruct/Chat variants

RESTful API Server

CODEBLOCK3

Endpoints:

- GET /health — health check
INLINECODE21 — INLINECODE22

CODEBLOCK4

For detailed API parameters, config options, evaluation pipeline, and dataset formats, see references/details.md.

Key Architecture Concepts

The approach works by:

1. Using an LLM as a pure language model (left-to-right generation)
At each step, computing a distortion probability for each candidate token based on how "similar" it is to the observed (possibly erroneous) character
Combining LM probability with distortion probability via beam search
Distortion types encode the relationship between observed and candidate characters (identical, same pinyin, similar shape, etc.)

The prompted_model parameter adds a second probability source: a prompt-based LLM that scores candidates given the full input sentence as context, improving correction quality.

简单中文拼写纠正

一种无需训练的中文拼写纠正方法，利用大型语言模型作为纯语言模型，结合束搜索和失真建模。

前置条件

本技能是simple-csc仓库的使用指南。在使用本文描述的任何命令或API之前，请先克隆该仓库并在其根目录下操作：

bash
git clone https://github.com/Jacob-Zhou/simple-csc.git
cd simple-csc

以下所有路径引用（例如configs/、scripts/、data/、eval/、datasets/）均相对于该仓库根目录。该仓库包含实际代码、配置文件、数据字典和脚本——本技能提供如何使用它们的知识。

快速参考

环境设置

bash

标准设置（创建虚拟环境，安装依赖）

bash scripts/set_environment.sh

针对Qwen3模型

bash scripts/setenvironmentqwen3.sh

推荐：安装flash-attn以获得更好的性能和更低的显存占用

pip install flash-attn --no-build-isolation

Qwen2/Qwen2.5警告：如果没有flash-attn，请设置torch_dtype=torch.bfloat16以避免意外行为。

Python API

python
import torch
from lmcsc import LMCorrector

corrector = LMCorrector(
model=Qwen/Qwen2.5-7B,
prompted_model=Qwen/Qwen2.5-7B, # 使用相同模型以节省显存
configpath=configs/c2ecconfig.yaml, # 或使用configs/default_config.yaml仅进行替换
torch_dtype=torch.bfloat16, # 针对没有flash-attn的Qwen2/2.5推荐使用
)

单句

outputs = corrector(完善农产品上行发展机智。)

=> [(完善农产品上行发展机制。,)]

批量处理

outputs = corrector([句子一, 句子二])

带上下文（相同长度的列表）

outputs = corrector([未挨前兆], contexts=[患者提问：])

流式处理（仅支持batch_size=1）

for output in corrector(完善农产品上行发展机智。, stream=True): print(output[0][0], end=\r, flush=True)

配置选择

配置	使用场景
configs/defaultconfig.yaml	仅替换的中文拼写纠正（v1.0.0风格）
configs/c2ecconfig.yaml

支持插入/删除的完整C2EC（v2.0.0） | | configs/democonfig.yaml | 与c2ecconfig相同，供演示应用使用 |

主要区别：c2ecconfig.yaml包含ROR（重排序）、MIS（缺失字符）、RED（冗余字符）失真类型以及lengthimmutable_chars数据文件。

RESTful API服务器

bash
python api_server.py \
--model Qwen/Qwen2.5-7B \
--prompted_model Qwen/Qwen2.5-7B \
--configpath configs/c2ecconfig.yaml \
--host 127.0.0.1 --port 8000 --workers 1 --bf16

端点：

- GET /health — 健康检查
POST /correction — {input: ..., stream: false, contexts: null}

bash

非流式

curl -X POST http://127.0.0.1:8000/correction \
-H Content-Type: application/json \
-d {input: 完善农产品上行发展机智。}

带上下文

curl -X POST http://127.0.0.1:8000/correction \ -H Content-Type: application/json \ -d {input: 未挨前兆, contexts: 患者提问：}

有关详细的API参数、配置选项、评估流程和数据集格式，请参见references/details.md。

关键架构概念

该方法的工作原理是：

1. 将大型语言模型用作纯语言模型（从左到右生成）
在每一步，根据每个候选词元与观察到的（可能错误的）字符的相似度计算其失真概率
通过束搜索将语言模型概率与失真概率相结合
失真类型编码了观察字符与候选字符之间的关系（相同、同音、形近等）

prompted_model参数增加了第二个概率来源：一个基于提示的大型语言模型，在给定完整输入句子作为上下文的情况下对候选词元进行评分，从而提高纠正质量。

simple-csc简易CSC

simple-csc

Simple CSC

Prerequisites