这本书当中依然有很多错误,译者也助长了错误。在第六章 语言模型部分,作者详细定义了各种概念,但是对于B的翻译不够好:训练实例的类别量,其实就是模型的参数数量或者n-gram的数量。围绕这个概念问题,出了一系列错误。
第一个错误表现在127页的译者注释,译者注意到manning公布的勘误表,注意到:“训练语料有273266个词形,B应该是273266,。。。译者注”。
这里的B应该是V,对于二元语法模型B = V^2。
第二个错误是128页:“这个训练语料库共有14585个词形。所以对于新的条件概率p(not|was),新的估计是(608 + 0.5)/(9404 + 14589*0.5)”。这里也跟着错,应该是:
(608 + 0.5)/(9404 + 14589^2*0.5)
当然这里是由于原作者错误,译者不察觉。相应地,表格6.5里的ELE估计都是错的,原文结论说折扣掉一半也完全错误。结论是第六章出现的一系列错误作者难辞其咎。译者也未能指出错误。
http://nlp.stanford.edu/fsnlp/errata.html
page 196, line -13: Change "This will be V^{n-1}" to "This will be V", given the following major clarification: In Section 6.1, the number of 'bins' is used to refer to the number of possible values of the classificatory feature vectors, while (unfortunately) from Section 6.2 on, with this change, the term 'bins' and the letter B is used to refer to the number of values of the target feature. This is V for prediction of the next word, but V^n for predicting the frequency of n-grams. (Thanks to Tibor Kiss <tibor .... linguistics.ruhr-uni-bochum.de>
page 202-203: While the whole corpus had 400,653 word types, the training corpus had only 273,266 word types. This smaller number should have been used as B in the calculation of a Laplace's law estimate of table 6.4 (whereas actually 400,653 was used). The result of this change is that f_{Lap}(0) = 0.000295, and then 99.96% of the probability mass is given to previously unseen bigrams (!). In such a model, note that we have used a (demonstrably wrong) closed vocabulary assumption, so despite this huge mass being given to unseen bigrams, none is being given to potential bigrams using vocabulary items outside the training set vocabulary (OOV = out of vocabulary items). (Thanks to Steve Renals <s.renals .... dcs.shef.ac.uk> and Gary Cottrell <gary .... cs.ucsd.edu>
page 205, line 2-3: Correction: here it is said that there are 14589 word types, but the number given elsewhere in the chapter (and the actual number found on rechecking the data file) is 14585. Clarification: Here we directly smooth the conditional distributions, so there are only |V| = 14585 values for the bigram conditional distribution added into the denominator during smoothing, whereas on pp. 202-203, we were estimating bigram probabilities, and there are |V|^2 different bigrams. (Thanks to Hidetosi Sirai <sirai .... sccs.chukyo-u.ac.jp>, Mark Lewellen <lewellen .... erols.com>, and Gary Cottrell <gary .... cs.ucsd.edu>
区分n-gram数量B与词汇量V
《统计自然语言处理基础》热门书评
-
翻译问题
4有用 0无用 终南长安 2010-06-10
P17(中文版)English:The significance of power laws中文:强法则的重要性power law:指数法則,幂律...
-
区分n-gram数量B与词汇量V
3有用 0无用 黠之大者 2015-05-04
这本书当中依然有很多错误,译者也助长了错误。在第六章 语言模型部分,作者详细定义了各种概念,但是对于B的翻译不够好:训练实例的类别量,其实就是模型的参数数量或者n-gram的数量。围绕这个概念问题,出了一系列错误。第一个错误表现在127页的译者注释,译者注意到manning公布的勘误表,注意到:“训...
-
SNLP的入门教程
1有用 0无用 盐汤儿 2008-08-21
这本书不是很厚,也没有自然语言处理综论介绍的全面。但就想要学习SNLP的人来说相当不错。同时书中除了自然语言处理中传统的如分词、标注等领域之外,在最后也涉及到了一些较为新型和更为交叉的领域。从SNLP这一领域做出了很好的诠释!...
-
翻译很差,原书内容不错
1有用 0无用 realplayer-z 2012-03-04
power law译成强法则,perplexity译成混乱度,碰到稍难一点的句子居然直接跳过不译,狂汗。现在还没看多少,感觉原书内容还是不错的,叙述比较完备,就是英文写得稍微难了点,不是特别简单易懂的写法。...
-
书让人有点失望
0有用 0无用 math007_地球物 2012-06-25
还行,但比想象的要差。缺点:书翻译的很蹩脚,写得也有些蹩脚。书里充满了概念。一个特点:文字多的地方,基本感觉易读性比较差,说来说去不知道在说什么了。公式多的反而好理解一些。不该省略的地方省略了不少。比如2.1.10 贝叶斯,贝叶斯大学学过的,但是33页那个讲的是什么玩意。 还有那个什么噪声信号模型,...
书名: 统计自然语言处理基础
作者:
出版社: 电子工业出版社
原作名: Foundations of Statistical Natural Language Processing
译者: 苑春法 | 李庆中 | 李伟
出版年: 2005-1
页数: 418
定价: 55.00元
装帧: 平装(无盘)
丛书: 国外计算机科学教材系列
ISBN: 9787505399211