NLP — Natural Language Processing Toolbox
A pure-bash NLP toolkit for text analysis. Tokenize text, analyze sentiment, extract named entities, summarize documents, compute text similarity, and classify text into categories — all from the command line with no external dependencies.
Commands
tokenize
Split text into words and sentences. Returns word count, sentence count, individual tokens, and the top 10 most frequent words.
CODEBLOCK0
sentiment
Analyze text sentiment using built-in positive/negative word lists. Returns polarity (positive/negative/neutral), a score from -1.0 to 1.0, confidence level, and matched word counts. Handles negators (e.g., "not good" flips sentiment) and intensifiers.
CODEBLOCK1
extract
Extract named entities from text: names/people (consecutive capitalized words), organizations (with suffixes like Inc, Corp, Ltd, LLC), dates (multiple formats), numbers with units, email addresses, and URLs.
CODEBLOCK2
summarize
Generate a summary by extracting the most important sentences. Scores sentences by word frequency with position bonuses (first/last sentences weighted higher). Control output length with --sentences N or --ratio 0.3.
CODEBLOCK3
similarity
Compute similarity between two texts using Jaccard index (word set overlap) and cosine similarity (word frequency vectors). Returns an overall score (average of both), shared word count, and unique word count. Scale: 0.0 = completely different, 1.0 = identical.
CODEBLOCK4
classify
Classify text into user-provided categories using keyword matching. Has built-in keyword dictionaries for common categories: finance, sports, tech, politics, science, health, positive, negative, neutral. Returns the predicted category with confidence scores and hit counts for each category.
CODEBLOCK5
Global Flags
| Flag | Description |
|---|
| INLINECODE2 | Output results in JSON format instead of plain text |
Input Methods
All commands accept input via three methods:
- 1.
--input "text" — inline text string --file path.txt — read from a file- Pipe via stdin — INLINECODE5
Data Storage
This tool is stateless — it does not write to disk. All processing happens in memory and output goes to stdout/stderr.
Requirements
- - Bash 4+ (uses associative arrays)
- INLINECODE6 with
-P (Perl regex) for entity extraction - INLINECODE8 for floating-point calculations
- No Python, no external NLP libraries — pure shell
When to Use
- 1. Quick text analysis — tokenize a document to get word counts and frequency distributions without leaving the terminal
- Sentiment checking — analyze customer reviews, social media posts, or feedback files for positive/negative polarity
- Entity extraction — pull out names, organizations, dates, emails, and URLs from unstructured text
- Document summarization — distill long articles or reports into key sentences at a chosen ratio
- Text comparison — measure how similar two documents are using Jaccard and cosine metrics for deduplication or plagiarism detection
Examples
CODEBLOCK6
Output
Plain text by default with clear section headers. Use --json flag for machine-readable JSON output suitable for piping into jq or other tools. Sentiment returns polarity and score. Extract returns categorized entity lists. Similarity returns a 0.0–1.0 score.
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
NLP — 自然语言处理工具箱
一个纯Bash实现的NLP工具包,用于文本分析。支持分词、情感分析、命名实体识别、文档摘要、文本相似度计算以及文本分类——全部在命令行中完成,无需任何外部依赖。
命令
tokenize
将文本分割为单词和句子。返回单词数、句子数、单个词元以及出现频率最高的前10个单词。
bash
bash scripts/script.sh tokenize --input The quick brown fox jumps over the lazy dog.
bash scripts/script.sh tokenize --file document.txt
bash scripts/script.sh tokenize --file document.txt --json
cat essay.txt | bash scripts/script.sh tokenize
sentiment
使用内置的正/负面词库分析文本情感。返回极性(正面/负面/中性)、-1.0到1.0之间的分数、置信度以及匹配到的单词数量。支持否定词(例如not good会翻转情感)和程度副词。
bash
bash scripts/script.sh sentiment --input I absolutely love this product! Its amazing.
bash scripts/script.sh sentiment --file reviews.txt
bash scripts/script.sh sentiment --input This was not good at all --json
extract
从文本中提取命名实体:人名(连续大写单词)、组织机构(带Inc、Corp、Ltd、LLC等后缀)、日期(多种格式)、带单位的数字、电子邮件地址和URL。
bash
bash scripts/script.sh extract --input John Smith works at Google Inc in Mountain View since 2020-01-15. Contact john@google.com
bash scripts/script.sh extract --file article.txt --json
summarize
通过提取最重要的句子生成摘要。根据词频对句子评分,并加入位置权重(首句/尾句权重更高)。使用--sentences N或--ratio 0.3控制输出长度。
bash
bash scripts/script.sh summarize --file long_article.txt --sentences 3
bash scripts/script.sh summarize --input Long text here... --ratio 0.3
cat report.txt | bash scripts/script.sh summarize --sentences 5
bash scripts/script.sh summarize --file paper.txt --json
similarity
使用Jaccard指数(单词集合重叠度)和余弦相似度(词频向量)计算两段文本的相似度。返回总体分数(两者的平均值)、共享单词数和独有单词数。范围:0.0 = 完全不同,1.0 = 完全相同。
bash
bash scripts/script.sh similarity --text1 The cat sat on the mat --text2 A cat was sitting on a mat
bash scripts/script.sh similarity --file1 doc1.txt --file2 doc2.txt
bash scripts/script.sh similarity --text1 hello world --text2 hello world --json
classify
使用关键词匹配将文本分类到用户提供的类别中。内置常见类别的关键词词典:金融、体育、科技、政治、科学、健康、正面、负面、中性。返回预测的类别及其置信度分数和每个类别的命中次数。
bash
bash scripts/script.sh classify --input The stock market rallied today on strong earnings --categories finance,sports,tech,politics
bash scripts/script.sh classify --file article.txt --categories positive,negative,neutral
bash scripts/script.sh classify --input New treatment shows promise in clinical trials --categories health,science,tech --json
全局标志
| 标志 | 描述 |
|---|
| --json | 以JSON格式输出结果,而非纯文本 |
输入方式
所有命令均支持三种输入方式:
- 1. --input 文本 — 内联文本字符串
- --file 路径.txt — 从文件读取
- 通过stdin管道 — cat file.txt | bash scripts/script.sh <命令>
数据存储
本工具无状态——不会写入磁盘。所有处理均在内存中完成,输出到stdout/stderr。
系统要求
- - Bash 4+(使用关联数组)
- 支持-P(Perl正则)的grep,用于实体提取
- awk,用于浮点数计算
- 无需Python,无需外部NLP库——纯Shell实现
使用场景
- 1. 快速文本分析 — 无需离开终端即可对文档进行分词,获取单词数和词频分布
- 情感检测 — 分析客户评论、社交媒体帖子或反馈文件的正/负面极性
- 实体提取 — 从非结构化文本中提取人名、组织机构、日期、电子邮件和URL
- 文档摘要 — 按选定比例将长文章或报告提炼为关键句子
- 文本比较 — 使用Jaccard和余弦指标衡量两篇文档的相似度,用于去重或查重
示例
bash
对文件进行分词并获取词频
bash scripts/script.sh tokenize --file essay.txt
情感分析,输出JSON格式
bash scripts/script.sh sentiment --input The movie was terrible and boring --json
从文章中提取实体
bash scripts/script.sh extract --file news_article.txt
将长文档摘要为5个关键句子
bash scripts/script.sh summarize --file report.txt --sentences 5
比较两篇文档的相似度
bash scripts/script.sh similarity --file1 original.txt --file2 revised.txt --json
将文本分类到指定类别
bash scripts/script.sh classify --input Scientists discovered a new particle at CERN --categories science,tech,politics,sports
输出
默认输出纯文本格式,带有清晰的章节标题。使用--json标志可输出机器可读的JSON格式,便于通过管道传递给jq或其他工具。情感分析返回极性和分数。实体提取返回分类后的实体列表。相似度返回0.0–1.0之间的分数。
由BytesAgain提供 | bytesagain.com | hello@bytesagain.com