中文分词文献列表

取自 自然语言处理百科

跳转到: 导航, 搜索

中文分词文献列表

由张开旭维护

如有意见与建议,欢迎联系作者:)

页面生成日期:2009年11月21日


目录

[编辑] 2009

Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study

  • Liu,Qun;Huang,Liang;Jiang,Wenbin
  • Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
  • Perceptron,分词与词性标注结合。将一种标注体系下的参数,转移到另一种标注体系中使用。

基于字依存树的中文词法-句法一体化分析

  • 揭春雨,宋彦,赵海
  • 中国计算机语言学研究前沿进展 (2007-2009)

基于 CRFs 的中文分词和短文本分类技术

  • 滕少华
  • 就分词来说,用Chi方做特征选择,一半的特征仍然可以保持性能。(如“的”,“和”,“了”)的有无对整句切分的正确性有帮助与干扰。RF的置信度输出,低置信度产生高错误率。则的、基于篇章上下文统计的低置信度后处理过程。

A Simple and Efficient Model Pruning Method for Conditional Random Fields

  • Kit,C.;Zhao,H.
  • Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
  • CRF训练后,按参数值去掉大部分特征,性能都不会下降,用事实证明CRF有太多冗余。

Character-Level Dependencies in Chinese: Usefulness and Learning

  • Zhao,H.
  • Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)}
  • 用字的依存树做分词。最后系统,词内是词法字依存关系,词之间是线性依存关系。当然最终效果没有现有最优系统好。

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

  • Yamada,T.;Mochihashi,D.;Ueda,N.
  • Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP}
  • 用Pitman-Yor,建立了两层语言模型,一个是词的,一个是句子的。

An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging

  • Torisawa,K.;Kruengkrai,C.;Kazama,J.;Uchimoto,K.;Isahara,H.;Wang,Y.
  • Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
  • 词典词与生词分别对待


[编辑] 2008

Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging

  • Liu,Qun;Mi,Haitao;Jiang,Wenbin
  • Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
  • 使用reranking。有别于top-n的reranking,使用指数规模的word lattice reranking。至少看oracle,后者比前者就好。解决的问题有:如何构造lattice,如何算oracle,有哪些特征,以及reranking的时候的cube剪枝。

Joint Word Segmentation and POS Tagging Using a Single Perceptron

  • Zhang,Yue;Clark,Stephen
  • Proceedings of ACL-08: HLT
  • 用perceptron,两个baseline系统,分别是分词与词性标注,都是binary特征,特征包括字的特征,词的特征,长度特征等的组合。两个一起做比分别做好。好得不多

A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

  • Liu,Qun;Huang,Liang;Lü,Yajuan,Jiang,Wenbin
  • Proceedings of ACL-08: HLT
  • 标注系统,4元的分词标注直积词性标注;核心是感知器,由于基于词的binary特征数目太多,所以感知器只用基于字的特征。后面还是一个线性模型,加上很多非binary的基于词以及标注的特征。

Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition

  • Kit,C.,Zhao,H.
  • The Sixth SIGHAN Workshop on Chinese Language Processing
  • 将accessor variety (AV)的结果离散化,然后分散到字,给为CRF的输入,可以提高分词效果。

An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework

  • Kit,C.,Zhao,H.
  • The Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India
  • 描述了四种用于无监督中文分词的判别量:Frequency of Substring with Reduction;Description Length Gain (DLG);Accessor Variety (AV);Boundary Entropy (Branching Entropy, BE)

Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation

  • Ney,H.;Xu,J.;Toutanova,K.;Gao,J.
  • Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

  • Sun,M.;Qiao,W.;Menzel,W.
  • Proceedings of the 11th international conference on Text, Speech and Dialogue

Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging

  • Jiang,W.;Mi,H.;Liu,Q.
  • Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)}


[编辑] 2007

基于有效子串标注的中文分词

  • 揭春雨, 赵海
  • 中文信息学报

中文分词十年回顾

  • 黄,昌宁;赵,海
  • 中文信息学报
  • 中文词的认同度。从863、973到sig han评测。语料库的质量控制(包括对“心理词”的规则制定)。基于语法的、基于规则的不如基于词的,又被基于字的取代。大规模真实文本中未登录词造成的分词精度失落比歧义切分造成的精度失落至少大5倍以上。基于字的,最大熵,SVM,CRF等。词位转移,2标注,4标注,微软的6标注。5字窗口足够了。

Chinese segmentation with a word-based perceptron algorithm

  • Clark,S.;Zhang,Y.
  • ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
  • 采用average perceptron,然后用一种lazy update的方法。采用了基于词的特征,所以解码使用柱搜索,而不能用贪心或者动态规划。

A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks

  • Wang,M.,Shi,Y.
  • International Joint Conferences on Artificial Intelligence (IJCAI)
  • 双层CRF做分词与词性标注,中规中矩。基于字信息分词;第二层基于词,以及字信息标注词性。RF分开训练,联合测试。第一层找N-best,再综合第一层第二层的结果重新排序。

A hybrid approach to word segmentation and pos tagging

  • Nakagawa,T.;Uchimoto,K.
  • ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
  • 字与词结合的Lattice,然后分词与标注结合。仍然用马尔可夫模型

[编辑] 2006

An improved Chinese word segmentation system with conditional random field

  • Huang,C. N.;Li,M.;Zhao,H.
  • Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing
  • 6-tag set;tone feature;assistant segmenters

Subword-based tagging by conditional random fields for chinese word segmentation

  • Zhang,R.;Kikui,G.;Sumita,E.
  • Proc. of HLTNAACL
  • subword-based tagging, 比如北京市 标注为 北京/l 市/r;不过还是用的三标注系统;使用CRF中的置信度,与基于词典的方法融合;CRF倾向于较高的OOV的F1,而较低的IV的F1

Unsupervised segmentation of Chinese text by use of branching entropy

  • Tanaka-Ishii,K.;Jin,Z.
  • Proceedings of the COLING/ACL on Main conference poster sessions
  • 如nature,随着字母的读入,nature后面跟的字母的不确定性比natur大得多,所以认为前者是一个可能的词边界。以此为基础,算出句子每个子序列的边界熵(前向后先两个方向)以此为判据。

Contextual dependencies in unsupervised word segmentation

  • Griffiths,T. L.;Johnson,M.;Goldwater,S.
  • ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS

基于D过程的语言模型与词法模型

  • 两个词两个词的Gibbs采样


[编辑] 2005

Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach Wu,Andi

  • Huang,Chang-Ning,Li,Mu,Gao,Jianfeng
  • Computational Linguistics
  • 使用perceptron学习线性模型;与基于字标注不同,解码前构造word lattice。相当于事先缩小了可能的字标注结果集合的大小。将词分为若干类,每一类会按概率计算一些概率值,作为perceptron的参数。perceptron的参数全是非binary的。只有词类的trigram的概率,不涉及任何具体字。

A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005

  • Manning,C.;Andrew,G.;Tseng,H.;Jurafsky,D.;Chang,P.
  • Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing
  • SIGHAN bakekoff 2005 中相当好的一个系统;单的词缀和叠字的feature在CRF里面

Perceptron Learning for Chinese Word Segmentation

  • Miao,C.;Li,Y.;Cunningham,H.;Bontcheva,K.
  • Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05)


[编辑] 2004

基于无指导学习策略的无词表条件下的汉语自动分词

  • 肖,明;邹,嘉彦;孙,茂松
  • 计算机学报
  • 使用互信息与t测试差当作两个判据以字为单位进行无监督分词。以字算的标注准确度可到85%左右。

Chinese segmentation and new word detection using conditional random fields

  • Peng,F.;McCallum,A.;Feng,F.
  • COLING 2004
  • 将CRF引入中文分词

Chinese part-ofspeech tagging: One-at-a-time or all-at-once? wordbased or character-based?

  • Low,J. K.;Ng,H. T.
  • Preceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  • 用最大熵模型试了三种方法,分开做分词与标注或者同时做,词性标注用基于字的特征或者用基于词的特征:同时的基于字的最好,但是时间慢很多。分开基于字的稍差,但快很多。分开基于词的,分词性能当然与基于字的一样,但词性标注差很多,总时间快一点。词性标注差是因为词之中的字对确定词性很重要。没有同时而且基于词的,估计是因为机器跑不动。也没有实验在分词阶段用基于词的特征。

Chinese and Japanese word segmentation using word-level and character-level information

  • Nakagawa,T.
  • Proceedings of the 20th international conference on Computational Linguistics
  • 字与词结合的Lattice做分词,使用马尔可夫模型

Applying conditional random fields to Japanese morphological analysis

  • Yamamoto,K.;Kudo,T.;Matsumoto,Y.;Proc. of EMNLP
  • 用改造过的CRF模型做日文分词。以词为单位,即y长度与x不一定相等。

Adaptive Chinese word segmentation

  • Huang,C. N.;Qin,H.;Li,M.;Li,H.;Wu,A.;Gao,J.;Xia,X.
  • Proceedings of ACL-2004

Accessor variety criteria for Chinese word extraction

  • Chen,K.;Zheng,W.;Deng,X.;Feng,H.
  • Computational Linguistics


[编辑] 2003

Chinese lexical analysis using hierarchical hidden Markov model

  • Zhang,H. P.;Cheng,X. Q.;Zhang,H.;Liu,Q.;Yu,H. K.
  • Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17

HHMM-based Chinese lexical analyzer ICTCLAS

  • Zhang,H. P.;Xiong,D. Y.;Liu,Q.;Yu,H. K.
  • Proceedings of Second SIGHAN Workshop on Chinese Language Processing
  • 最实用化的分词工具包ICTCLAS的介绍性论文。

Chinese Word Segmentation as LMR Tagging

  • Xue,N.;Shen,L.
  • Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

The first international Chinese word segmentation bakeoff

  • Sproat,R.;Emerson,T.
  • Proceedings of the second SIGHAN workshop on Chinese language processing

A maximum entropy Chinese character-based parser

  • Luo,X.
  • Proceedings of the 2003 conference on Empirical methods in natural language processing-Volume 10

Chinese Word Segmentation Using Minimal Linguistic Knowledge

  • Chen,A.
  • Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

Combining segmenter and chunker for Chinese word segmentation

  • Goh,C. L.;Wang,X.;Asahara,M.;Matsumoto,Y.
  • Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing


[编辑] 2002

Combining classifiers for Chinese word segmentation

  • Xue,Nianwen;Converse,S. P.
  • Proceedings of the first SIGHAN workshop on Chinese language processing-Volume 18
  • 里程碑,第一次提出字标注的分词模型

[编辑] 2001

汉语自动分词研究评述

  • 邹,嘉彦;孙,茂松
  • 当代语言学
  • 对上世纪中文分词研究的一个较好的回顾及评论。歧义,交集歧义与覆盖歧义;OOV。

Defining and automatically identifying words in Chinese

  • Xue,Nianwen

Self-supervised Chinese word segmentation

  • Schuurmans,D.;Peng,F.
  • 纯无监督分词,EM算法;f-supervised,分两个词典剪枝

[编辑] 2000

A compression-based algorithm for Chinese word segmentation

  • Teahan,W. J.;Witten,Ian H.;Wen,Yingying;McNab,Rodger
  • Comput. Linguist.


[编辑] 1999

Discovering Chinese words from unsegmented text (poster abstract)

  • Smyth,P.;Pratt,W.;Ge,X.
  • Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
  • 粗体 纯无监督分词,EM算法,0阶隐马尔可夫链


[编辑] 1998

串频统计和词形匹配相结合的汉语自动分词系统

  • 吴岩, 刘挺,
  • 中文信息学报

Chinese word segmentation without using lexicon and hand-crafted training data

  • Dayang,S.;Tsou,B. K.;Maosong,S.
  • Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2

A hybrid approach to word segmentation

  • Manandhar,S.;Kazakov,D.;Lecture notes in computer science


[编辑] 1997

中文信息处理中的分词问题

  • 黄,昌宁
  • Applied Linguistics


[编辑] 1996

A stochastic finite-state word-segmentation algorithm for Chinese

  • Sproat,R.;Chang,N.;Shih,C.;Gale,W.
  • Computational Linguistics

Useg: A retargetable word segmentation procedure for information retrieval

  • Ponte,J.;Croft,W. B.
  • Symposium on Document Analysis and Information Retrieval


[编辑] 1991

Using Statistics in Lexical Analysis

  • Hindle,D.;Gale,W.;Church,K.;Hanks,P.


转自:http://nlp.csai.tsinghua.edu.cn/~zkx/cws/bib.html

个人工具
工具箱