NLP — Natural Language Processing Toolbox

A pure-bash NLP toolkit for text analysis. Tokenize text, analyze sentiment, extract named entities, summarize documents, compute text similarity, and classify text into categories — all from the command line with no external dependencies.

Commands

tokenize

Split text into words and sentences. Returns word count, sentence count, individual tokens, and the top 10 most frequent words.

CODEBLOCK0

sentiment

Analyze text sentiment using built-in positive/negative word lists. Returns polarity (positive/negative/neutral), a score from -1.0 to 1.0, confidence level, and matched word counts. Handles negators (e.g., "not good" flips sentiment) and intensifiers.

CODEBLOCK1

extract

Extract named entities from text: names/people (consecutive capitalized words), organizations (with suffixes like Inc, Corp, Ltd, LLC), dates (multiple formats), numbers with units, email addresses, and URLs.

CODEBLOCK2

summarize

Generate a summary by extracting the most important sentences. Scores sentences by word frequency with position bonuses (first/last sentences weighted higher). Control output length with --sentences N or --ratio 0.3.

CODEBLOCK3

similarity

Compute similarity between two texts using Jaccard index (word set overlap) and cosine similarity (word frequency vectors). Returns an overall score (average of both), shared word count, and unique word count. Scale: 0.0 = completely different, 1.0 = identical.

CODEBLOCK4

classify

Classify text into user-provided categories using keyword matching. Has built-in keyword dictionaries for common categories: finance, sports, tech, politics, science, health, positive, negative, neutral. Returns the predicted category with confidence scores and hit counts for each category.

CODEBLOCK5

Global Flags

Flag	Description
INLINECODE2	Output results in JSON format instead of plain text

Input Methods

All commands accept input via three methods:

1. --input "text" — inline text string
--file path.txt — read from a file
Pipe via stdin — INLINECODE5

Data Storage

This tool is stateless — it does not write to disk. All processing happens in memory and output goes to stdout/stderr.

Requirements

- Bash 4+ (uses associative arrays)
INLINECODE6 with -P (Perl regex) for entity extraction
INLINECODE8 for floating-point calculations
No Python, no external NLP libraries — pure shell

When to Use

1. Quick text analysis — tokenize a document to get word counts and frequency distributions without leaving the terminal
Sentiment checking — analyze customer reviews, social media posts, or feedback files for positive/negative polarity
Entity extraction — pull out names, organizations, dates, emails, and URLs from unstructured text
Document summarization — distill long articles or reports into key sentences at a chosen ratio
Text comparison — measure how similar two documents are using Jaccard and cosine metrics for deduplication or plagiarism detection

Examples

CODEBLOCK6

Output

Plain text by default with clear section headers. Use --json flag for machine-readable JSON output suitable for piping into jq or other tools. Sentiment returns polarity and score. Extract returns categorized entity lists. Similarity returns a 0.0–1.0 score.

NLP — 自然语言处理工具箱

一个纯Bash实现的NLP工具包，用于文本分析。支持分词、情感分析、命名实体识别、文档摘要、文本相似度计算以及文本分类——全部在命令行中完成，无需任何外部依赖。

命令

tokenize

将文本分割为单词和句子。返回单词数、句子数、单个词元以及出现频率最高的前10个单词。

bash
bash scripts/script.sh tokenize --input The quick brown fox jumps over the lazy dog.
bash scripts/script.sh tokenize --file document.txt
bash scripts/script.sh tokenize --file document.txt --json
cat essay.txt | bash scripts/script.sh tokenize

sentiment

使用内置的正/负面词库分析文本情感。返回极性（正面/负面/中性）、-1.0到1.0之间的分数、置信度以及匹配到的单词数量。支持否定词（例如not good会翻转情感）和程度副词。

bash
bash scripts/script.sh sentiment --input I absolutely love this product! Its amazing.
bash scripts/script.sh sentiment --file reviews.txt
bash scripts/script.sh sentiment --input This was not good at all --json

extract

从文本中提取命名实体：人名（连续大写单词）、组织机构（带Inc、Corp、Ltd、LLC等后缀）、日期（多种格式）、带单位的数字、电子邮件地址和URL。

bash
bash scripts/script.sh extract --input John Smith works at Google Inc in Mountain View since 2020-01-15. Contact john@google.com
bash scripts/script.sh extract --file article.txt --json

summarize

通过提取最重要的句子生成摘要。根据词频对句子评分，并加入位置权重（首句/尾句权重更高）。使用--sentences N或--ratio 0.3控制输出长度。

bash
bash scripts/script.sh summarize --file long_article.txt --sentences 3
bash scripts/script.sh summarize --input Long text here... --ratio 0.3
cat report.txt | bash scripts/script.sh summarize --sentences 5
bash scripts/script.sh summarize --file paper.txt --json

similarity

使用Jaccard指数（单词集合重叠度）和余弦相似度（词频向量）计算两段文本的相似度。返回总体分数（两者的平均值）、共享单词数和独有单词数。范围：0.0 = 完全不同，1.0 = 完全相同。

bash
bash scripts/script.sh similarity --text1 The cat sat on the mat --text2 A cat was sitting on a mat
bash scripts/script.sh similarity --file1 doc1.txt --file2 doc2.txt
bash scripts/script.sh similarity --text1 hello world --text2 hello world --json

classify

使用关键词匹配将文本分类到用户提供的类别中。内置常见类别的关键词词典：金融、体育、科技、政治、科学、健康、正面、负面、中性。返回预测的类别及其置信度分数和每个类别的命中次数。

bash
bash scripts/script.sh classify --input The stock market rallied today on strong earnings --categories finance,sports,tech,politics
bash scripts/script.sh classify --file article.txt --categories positive,negative,neutral
bash scripts/script.sh classify --input New treatment shows promise in clinical trials --categories health,science,tech --json

全局标志

标志	描述
--json	以JSON格式输出结果，而非纯文本

输入方式

所有命令均支持三种输入方式：

1. --input 文本 — 内联文本字符串
--file 路径.txt — 从文件读取
通过stdin管道 — cat file.txt | bash scripts/script.sh <命令>

数据存储

本工具无状态——不会写入磁盘。所有处理均在内存中完成，输出到stdout/stderr。

系统要求

- Bash 4+（使用关联数组）
支持-P（Perl正则）的grep，用于实体提取
awk，用于浮点数计算
无需Python，无需外部NLP库——纯Shell实现

使用场景

1. 快速文本分析 — 无需离开终端即可对文档进行分词，获取单词数和词频分布
情感检测 — 分析客户评论、社交媒体帖子或反馈文件的正/负面极性
实体提取 — 从非结构化文本中提取人名、组织机构、日期、电子邮件和URL
文档摘要 — 按选定比例将长文章或报告提炼为关键句子
文本比较 — 使用Jaccard和余弦指标衡量两篇文档的相似度，用于去重或查重

示例

bash

对文件进行分词并获取词频

bash scripts/script.sh tokenize --file essay.txt

情感分析，输出JSON格式

bash scripts/script.sh sentiment --input The movie was terrible and boring --json

从文章中提取实体

bash scripts/script.sh extract --file news_article.txt

将长文档摘要为5个关键句子

bash scripts/script.sh summarize --file report.txt --sentences 5

比较两篇文档的相似度

bash scripts/script.sh similarity --file1 original.txt --file2 revised.txt --json

将文本分类到指定类别

bash scripts/script.sh classify --input Scientists discovered a new particle at CERN --categories science,tech,politics,sports

输出

默认输出纯文本格式，带有清晰的章节标题。使用--json标志可输出机器可读的JSON格式，便于通过管道传递给jq或其他工具。情感分析返回极性和分数。实体提取返回分类后的实体列表。相似度返回0.0–1.0之间的分数。

由BytesAgain提供 | bytesagain.com | hello@bytesagain.com

nlp自然语言处理

nlp

NLP — Natural Language Processing Toolbox

Commands

tokenize

sentiment

extract

summarize

similarity

classify

Global Flags

Input Methods

Data Storage

Requirements

When to Use

Examples

Output

NLP — 自然语言处理工具箱

命令

tokenize

sentiment

extract

summarize

similarity

classify

全局标志

输入方式

数据存储

系统要求

使用场景

示例

对文件进行分词并获取词频

情感分析，输出JSON格式

从文章中提取实体

将长文档摘要为5个关键句子

比较两篇文档的相似度

将文本分类到指定类别

输出

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement