NLTK读书笔记 — Python进行语言处理
取自 自然语言处理百科
0. 本章要解决的问题:
- 我们把简单的程序设计技术和大量的文本组合在一起,能得到什么
- 怎样提取构成文本风格和内容的key words和phrases
- Python对上述工作提供哪些工具与技术支持
- NLP中一些非常有趣的挑战都有什么
1. 以语言(texts and words)作计算
在NLP处理程序看来,text就是raw data作为程序的输入 — text就是 a sequence of words and punctuation.
Python — texts as lists of words 使用python的list来表示text
2. 简单统计信息
(1) Frequency Distributions 频率分布
FreqDist(text) 用以统计text中每个词出现的频率,以词为key,词频为value,结果按词频由大到小排序
(2) Fine-grained Selection of Words 精细词语选择
使用Python的列表解析功能:
{w | w ∈ V & P(w)}
等价于
[w for w in V if P(w)]
(3) Collocations and Bigrams 词组和二元语法结构
bigrams(list_of_str) 将列表中的字符串相邻两两组成二元语法结构
text.collocations() 给出text中所有的词组
(4) 统计信息小结
Example Description
fdist = FreqDist(samples) create a frequency distribution containing the given samples
fdist.inc(sample) increment the count for this sample
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq(‘monstrous’) frequency of a given sample
fdist.N() total number of samples
fdist.keys() the samples sorted in order of decreasing frequency
for sample in fdist: iterate over the samples, in order of decreasing frequency
fdist.max() sample with the greatest count
fdist.tabulate() tabulate the frequency distribution
fdist.plot() graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2
3. 自动的自然语言理解 Automatic Natural Language Understanding
(1) 语义销歧 Word Sense Disambiguation
(2) 代词指代 Pronoun Resolution
(3) 语言生成 Generating Language Output
(4) 机器翻译 Machine Translation
NLTK提供了 babelizer_shell(),可以试用一下
(5) 口语对话系统 Spoken Dialog System
(6) Textual Entailment
RTE: Recognizing Textual Entailment 通过一段文本来推断某假设是否正确

