Srilm 阅读文档15-Version2
取自 自然语言处理百科
Discount.cc Discount.h
文档作者:jianzhu
修改时间: 08.12.15-08.12.17
注:本文档改写自rickjin书写的Discount文档修正了原文档中存在的一些错误,并扩充了部分内容;阅读该文档前,建议先阅读srilm的ngram-discount.7.html手册;或者查看jianzhu的翻译文档Ngram折扣平滑算法
1、基本类
Discount.h Discount.cc 这两个文件主要实现了最重要的几个折扣算法, 包括
a. Katz Discount (基于 Good-Turing Discounting) b. Absolute Discounting c. Natural law of succession [Eric Sven Ristad, 1995] d. Additive Discounting [Lidstone-Johnson-Jeffrey] e Witten-Bell Discounting f. Kneser-Ney Discounting g. Modified Kneser-Ney Discounting, [Chen, Goodman, 1998]
2、类接口说明
2.1) Discount 类主要接口
<src>
class Discount
{
public:
/**
* @brief 构造函数
* 通过成员初始化列表的方式将interpolate初始化为false,表示不支持插值平滑
*/
Discount() : interpolate(false) {};
virtual ~Discount() {};
/* 使用折扣算法, 返回频率 count 对应的折扣系数
* @param count 当前 ngram 频率 ( C(a_z) )
* @param totalCount 历史 n-1gram 频率 ( C(a_) )
* @param oberservedVocab 语料中统计到的历史ngram a_ 后接的 z 的类型数
* */
virtual double discount(Count count, Count totalCount, Count observedVocab);
virtual double discount(FloatCount count, FloatCount totalCount,
Count observedVocab);
/* 在插值模型中, 低阶模型的权值大小
*
* @param min2Vocab 频率>=2 的ngram type 数目
* @param min3Vocab 频率>=3 的ngram type 数目
* */
virtual double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab);
virtual double lowerOrderWeight(FloatCount totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab);
/* 是否支持该折扣操作 */
virtual Boolean nodiscount();
/* 保存折扣系数 */
virtual void write(File &file);
/* 读入折扣系数 */
virtual Boolean read(File &file);
/* 计算折扣系数
*
* @param counts ngram 计数器
* @param order 对 order 阶ngram 进行平滑计算
* */
virtual Boolean estimate(NgramStats &counts, unsigned order);
virtual Boolean estimate(NgramCounts<FloatCount> &counts, unsigned order);
/* 在平滑之前, 对 ngram 的频率进行调整,在 KneserNey 平滑算法中使用 */
virtual void prepareCounts(NgramCounts<NgramCount> &counts,
unsigned order, unsigned maxOrder);
virtual void prepareCounts(NgramCounts<FloatCount> &counts,
unsigned order, unsigned maxOrder);
/* 是否支持插值,这里将interpolate置为public区域,是为子类的继承 */
Boolean interpolate;
protected:
/* Vocab 中有效条目的 ngram 数目 */
static unsigned vocabSize(Vocab &vocab);
};
<src>
2.2) GoodTuring 类主要接口
<src>
class GoodTuring: public Discount
{
public:
/* 构造函数,该构造函数中并未修改从父类继承的interploate值,
* 因此interpolate为false,表示不支持插值平滑
*/
GoodTuring(unsigned mincount = GT_defaultMinCount,
unsigned maxcount = GT_defaultMaxCount);
/* 折扣系数估算函数
* jianzhu added 2008-12-15
*/
Boolean estimate(NgramStats &counts, unsigned order);
protected:
/* 小于该值的频率将被设置为 0 */
Count minCount;
/* 大于该值的频率将不平滑, 保持不变 */
Count maxCount;
/* 平滑系数 */
Array<double> discountCoeffs;
};
</src>
说明:
a. 该函数实现标准的基于图灵平滑的 Katz 平滑算法
b. 图灵平滑算法为 r* = (r + 1) * n_(r+1) / n_r, Katz 平滑算法如下
|-- d_r * r (0 < r < k)
r_katz = |-- r (r > k)
|-- alpha (r = 0)
d_r = (r*/r - commonTerm ) / (1 - commonTerm)
commonTerm = (k+1) * n_(k+1) / n_1
在此处 k 默认值为 GT_defaultMaxCount = 5;
折扣系数估算函数
<src>
Boolean
GoodTuring::estimate(NgramStats &counts, unsigned order)
{
Array<Count> countOfCounts;
/*
* First tabulate count-of-counts for the given order of ngrams
* Note we need GT count for up to maxCount + 1 inclusive, to apply
* the GT formula for counts up to maxCount.
*/
makeArray(VocabIndex, wids, order + 1);
NgramsIter iter(counts, wids, order);
NgramCount *count;
Count i;
for (i = 0; i <= maxCount + 1; i++) {
countOfCounts[i] = 0;
}
while (count = iter.next()) {
if (counts.vocab.isNonEvent(wids[order - 1])) {
continue;
} else if (counts.vocab.isMetaTag(wids[order - 1])) {
unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
/*
* process count-of-count
*/
if (type > 0 && type <= maxCount + 1) {
countOfCounts[type] += *count;
}
} else if (*count <= maxCount + 1) {
countOfCounts[*count] ++;
}
}
if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
dout() << "Good-Turing discounting " << order << "-grams\n";
for (i = 0; i <= maxCount + 1; i++) {
dout() << "GT-count [" << i << "] = " << countOfCounts[i] << endl;
}
}
if (countOfCounts[1] == 0) {
cerr << "warning: no singleton counts\n";
maxCount = 0;
}
while (maxCount > 0 && countOfCounts[maxCount + 1] == 0) {
cerr << "warning: count of count " << maxCount + 1 << " is zero "
<< "-- lowering maxcount\n";
maxCount --;
}
if (maxCount <= 0) {
cerr << "GT discounting disabled\n";
} else {
double commonTerm = (maxCount + 1) *
(double)countOfCounts[maxCount + 1] /
(double)countOfCounts[1];
for (i = 1; i <= maxCount; i++) {
double coeff;
if (countOfCounts[i] == 0) {
cerr << "warning: count of count " << i << " is zero\n";
coeff = 1.0;
} else {
double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
(i * (double)countOfCounts[i]);
coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
if (!finite(coeff) || coeff <= Prob_Epsilon || coeff0 > 1.0) {
cerr << "warning: discount coeff " << i
<< " is out of range: " << coeff << "\n";
coeff = 1.0;
}
}
discountCoeffs[i] = coeff;
}
}
return true;
}
</src>
功能:估算NgramStats中特定元数(order)的,从1到maxCount的所有折扣系数,并
将该折扣系数保存到discountCoeffs数组中
细解:
注:阅读以下部分,如有疑问可以参考srilm ngram-discount.7.html文档,或jianzhu的
Ngram折扣平滑算法翻译文档。另外关于katz的折扣计算公式可以参考speech and language
processing一书的第四章 N-GRAMS
commonTerm = (maxCount + 1) * (double)countOfCounts[maxCount+1]/(double)countOfCounts[1];
这里:
commonTerm -- A
maxCount -- gtmax
countOfCounts[maxCount+1] -- countOfCounts[gtmax+1] -- n[gtmax+1]
countOfCounts[1] -- n[1]
因此上式即转化为
A = (gtmax+1)*(double)n[gtmax+1]/(double)n[1]
coeff0 = (i+1) * (double)countOfCounts[i+1]/(i * (double)countOfCounts[i]);
这里:
i -- c(a_z)
i+1 -- c(a_z)+1
countOfCounts[i+1] -- n[c(a_z)+1]
countOfCounts[i] -- n[c(a_z)]
因此
coeff0 = (c(a_z)+1) * (double)n[c(a_z)+1]/(c(a_z) * (double)n[c(a_z)])
这里coeff0 相当远 r*/r
coeff = (coeff0 - commnTerm)/(1.0 - commonTerm)
= [ (c(a_z)+1) * (double)n[c(a_z)+1]/(c(a_z) * (double)n[c(a_z)])
- (gtmax+1)*(double)n[gtmax+1]/(double)n[1] ] / [ 1.0 - (gtmax+1)* (double)n[gtmax+1]/(double)n[1] ]
这里 coeff 相当于 d_r = (r*/r - commonTerm ) / (1 - commonTerm);
注上述公式根据katz折扣算法而来,该算法的公式如下
C' = [ (c(a_z)+1) * (double)n[c(a_z)+1]/(double)n[c(a_z)]
- c(a_z)*(gtmax+1)*(double)n[gtmax+1]/(double)n[1] ] / [ 1.0 - (gtmax+1)* (double)n[gtmax+1]/(double)n[1] ]
又因为有coeff = C'/C,因为目标为 C * coeff -> C'
2.3) ConstDisCount 类主要接口
<src>
class ConstDiscount: public Discount
{
public:
double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab)
{ return _discount * observedVocab / totalCount; }
protected:
/* 打折的常数项 */
double _discount;
/* 低于此值的频率被设置为 0 */
double _mincount;
};
说明: a. 此折扣算法即为 Absolute Discounting b. 算法思想就是对每个频数count, 将其减去一个常数 _discount, _discount介于 0和1之间,所以变为 (count - _discount), 则折扣系数为 (count-_discount) / count c. 常数_discount一般设为 _discount = n1 / (n1 + 2*n2) 上式中 n1代表出现次数为1次的ngram数目 n2代表出现次数为2次的ngram数目
该类中用于估算折扣系数的函数
<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
return (count <= 0) ? 1.0 :
(count < _mincount) ? 0.0 :
(count - _discount) / count; };
</src>
功能:上述函数用于返回频数为count的ngram的折扣系数
1) 若count <= 0 则返回折扣系数1.0,表示不折扣
2) 若count < _mincount 则返回折扣系数0.0,表示将该频数折扣为0
3) 否则根据公式 (count - _discount) / count 计算折扣系数并返回。
该类由于支持插值平滑,因此有用于估算插值平滑时,低阶ngram权重的函数
<src>
double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab)
{
return _discount * observedVocab / totalCount;
}
</src>
功能:计算用于估算插值平滑时,低阶ngram权重的函数
为了理解该函数,可以参考Ngram折扣平滑算法文档中Ney的绝对折扣算法这一部分
bow(a_) = 1 - Sum_Z1 g(a_z) ; Eqn.5
= D n(a_*) / c(a_)
因此 _discount -- D
n(a_*) -- observedVocab
c(a_) -- totalCount
这里有一个疑问就是Absolute Discount是支持插值平滑算法的,但该类中并未看到将
interpolate设为true的函数,后面的其他折扣算法类也有同样的问题,我猜测可能是
因为父类存在lowerOrderWeight函数,而且返回为0.0,若子类未重新定义该函数,则
沿用父类中的函数,直接返回0.0,表示不支持插值平滑。
2.4) NaturalDiscount 类主要接口
<src>
class NaturalDiscount: public Discount
{
public:
...
protected:
/* ngram 的条目数 */
unsigned _vocabSize; /* vocabulary size */
/* 低于此值的频率被设置为 0 */
double _mincount; /* minimum count to retain */
};
</src>
说明:
a. 此平滑算法为 Natural law of succession [Eric Sven Ristad, 1995]
b. 算法思想就是对每个频率count, 使用如下固定的平滑系数
d = ( n(n+1) + q(1-q) ) / ( n^2 + n + 2q )
其中 n 为 totalCount(总频率), q 为 observedVocab(ngram type 数)
该类中用于估算折扣系数的函数
<src>
double
NaturalDiscount::discount(Count count, Count totalCount, Count observedVocab)
{
double n = totalCount;
double q = observedVocab;
if (count < _mincount) {
return 0.0;
} else if (q == _vocabSize) {
return 1.0;
} else {
return (n * (n+1) + q * (1 - q)) / (n * (n + 1) + 2 * q);
}
}
</src>
功能:上述函数用于返回频数为count的ngram的折扣系数
1) 若count 小于_mincount值,则返回折扣系数0.0,表示将该频数折扣为0
2) n 和 q 介绍
要理解这里的 n 和 q 首先需要了解一下Natural Discount的公式
该公式如下表示
c(a_)(c(a_) + 1) + n(a_*)(1 - n(a_*))
---------------------------------------
c(a_)^2 + c(a_) + 2 n(a_*)
因此 n 就相当于上述公式中的c(a_),q 相当于上述公式中的 n(a_*)
因此 n 就相当于所有出现的n元 c(a_*) token的频数和
q 就相当于所有出现的n元 c(a_*) 中 * 这个词的种类数,即不同
的n元 c(a_*)的个数
2.5) AddSmooth 类主要接口
class AddSmooth: public Discount
{
public:
AddSmooth(double delta = 1.0, unsigned mincount = 0)
: _delta(delta < 0.0 ? 0.0 : delta),
_mincount(mincount) {};
protected:
double _delta; /* the additive constant */
/* 低于此值的频率被设置为 0 */
double _mincount; /* minimum count to retain */
/* ngram 的条目数 */
unsigned _vocabSize; /* vocabulary size */
};
说明:
a. 此平滑算法为 Additive smoothing, 可以视为 Laplace add 1 smoothing 的扩展
b. 算法思想就是对每个频率c, 使用如下方法计算概率
p = (c + delta) / (T + N * delta)
其中 T 为 totalCount(总频率), N 为 _vocabSize(ngram 中包含的不同词条数)
由此, 折扣系数为
d = (1 + delta/c) / (1 + N * delta / T)
c. 该折扣算法可以看作是最大似然估计ML 和 均匀分布之间的插值
p = lambda * p_ML + (1 - lambda) * 1/N
计算折扣系数的函数
<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
return (count <= 0) ?
1.0 : (count < _mincount) ?
0.0 : (1.0 + _delta/count) / (1.0 + _vocabSize*_delta/totalCount);
}
</src>
功能:上述函数用于返回频数为count的ngram的折扣系数
1) 若count <= 0,则返回1.0,表示不做任何折扣
2) 若count < _mincount,这返回0.0,表示将频数折扣为0
3) 否则返回(1.0 + _delta/count) / (1.0 + _vocabSize*_delta/totalCount)
上述公式即为中count 即为c(a_z),totalCount即为c(a_)。c(a_) 即为所有
c(a_*)的叠加,_vocabSize即为所有n元包含的不同词条数。
2.6) Witten-Bell 折扣算法
class WittenBell: public Discount
{
public:
WittenBell(unsigned mincount = 0) : _mincount(mincount) {};
/** 用于估算回退平滑时的折扣系数 */
double discount(Count count, Count totalCount, Count observedVocab);
/** 用于估算插值平滑时,计算低阶ngram权重的函数 */
double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab);
protected:
/* 低于此频数的频数被置为0 */
double _mincount; /* minimum count to retain */
};
用于估算回退平滑时的折扣系数
<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
return (count <= 0) ? 1.0 : (count < _mincount) ? 0.0 :
((double)totalCount / (totalCount + observedVocab));
}
</src>
功能:估算回退平滑时的折扣系数
1) 当count <= 0 时,直接返回1.0表示不做折扣计算
2) 当count < _mincount时,直接返回0.0,表示折扣为0
3) 否则witten-bell回退平滑下有如下概率折扣计算公式
f(a_z) = c(a_z) / (n(a_*) + c(a_)) (参考Ngram折扣平滑算法文档)
上式中c(a_z)相当于count,n(a_*)相当于observedVocab,c(a_)相当于totalCount
又因为p(a_z) = c(a_z)/c(a_),因此有f(a_z)*c(a_) = c'(a_z),这里的c'(a_z)
为折扣以后的频数值,因此有:
c'(a_z) = c(a_z)*c(a_) / (n(a_*) + c(a_))
又因为折扣系数d
d = c'(a_z)/c(a_z) = c(a_) / (n(a_*) + c(a_))
= ((double)totalCount / (totalCount + observedVocab))
用于插值平滑时,计算低阶ngram权重的函数
<src>
double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab)
{
return (double)observedVocab / (totalCount + observedVocab);
}
</src>
功能:用于插值平滑时,计算低阶ngram权重的函数
witten-bell插值平滑下,低阶权重有如下估算公式:
bow(a_) = n(a_*) / (n(a_*) + c(a_)) (参考Ngram折扣平滑算法文档)
上式中n(a_*)相当于observedVocab,c(a_)相当于totalCount,因此有:
(double)observedVocab / (totalCount + observedVocab)
2.7) Kneser-Ney 折扣算法
class KneserNey: public Discount
{
public:
/** 用于估算回退平滑时的折扣系数 */
virtual double discount(Count count, Count totalCount, Count observedVocab);
/** 用于估算插值平滑时,计算低阶ngram权重的函数 */
virtual double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab);
/* 用于计算Kneser-Ney折扣算法中折扣常数discount1 */
virtual Boolean estimate(NgramStats &counts, unsigned order);
protected:
/* 低于此数值的频数将被设为0 */
Count minCount;
/* 折扣常数 */
double discount1;
};
折扣常数计算函数
<src>
Boolean
KneserNey::estimate(NgramStats &counts, unsigned order)
{
if (!prepareCountsAtEnd) {
prepareCounts(counts, order, counts.getorder());
}
/*
* First tabulate count-of-counts
*/
Count n1 = 0;
Count n2 = 0;
makeArray(VocabIndex, wids, order + 1);
NgramsIter iter(counts, wids, order);
NgramCount *count;
while (count = iter.next()) {
if (counts.vocab.isNonEvent(wids[order - 1])) {
continue;
} else if (counts.vocab.isMetaTag(wids[order - 1])) {
unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
/*
* process count-of-count
*/
if (type == 1) {
n1 ++;
} else if (type == 2) {
n2 ++;
}
} else if (*count == 1) {
n1 ++;
} else if (*count == 2) {
n2 ++;
}
}
if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
dout() << "Kneser-Ney smoothing " << order << "-grams\n"
<< "n1 = " << n1 << endl
<< "n2 = " << n2 << endl;
}
if (n1 == 0 || n2 == 0) {
cerr << "one of required KneserNey count-of-counts is zero\n";
return false;
}
discount1 = n1/((double)n1 + 2*n2);
if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
dout() << "D = " << discount1 << endl;
}
if (prepareCountsAtEnd) {
prepareCounts(counts, order, counts.getorder());
}
return true;
}
</src>
功能:根据Kneser-Ney折扣常数计算公式有:
D = n1 / (n1 + 2*n2) (参考Ngram折扣平滑算法文档)
上式中:
D为discount1
n1代表出现次数为1次的ngram数目
n2代表出现次数为2次的ngram数目
估算回退平滑时的折扣系数的函数
<src>
double
KneserNey::discount(Count count, Count totalCount, Count observedVocab)
{
if (count <= 0) {
return 1.0;
} else if (count < minCount) {
return 0.0;
} else {
return (count - discount1) / count;
}
}
</src>
功能:估算回退平滑时的折扣系数
1) 当count <= 0 时,直接返回1.0表示不做折扣计算
2) 当count < _mincount时,直接返回0.0,表示折扣为0
3) 否则根据Knesery-Ney的回退平滑时对应的折扣概率计算公式
f(a_z) = (c(a_z) - D0) / c(a_) (参考Ngram折扣平滑算法文档)
上式中c(a_z)相当于count, D0相当于discount1,c(a_)相当于totalCount
又因为p(a_z) = c(a_z)/c(a_),因此有c(a_) * f(a_z)= c'(a_z),这里的c'(a_z)
为折扣以后的频数值,因此有:
c'(a_z) = c(a_) * (c(a_z) - D0) / c(a_) = (c(a_z) - D0)
又因为折扣系数d
d = c'(a_z)/c(a_z) = (c(a_z) - D0)/ c(a_z)
= (count - discount1) / count;
估算插值平滑时,计算低阶ngram权重的函数
<src>
double
KneserNey::lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab)
{
return (discount1 * observedVocab / totalCount);
}
</src>
功能:估算插值平滑时,计算低阶ngram权重
根据Kneser-Ney插值平滑时对应的低阶ngram权重计算公式:
bow(a_) = D n(a_*) / c(a_)
上式中 D 相当于discount1,n(a_*)相当于observedVocab
c(a_)相当于totalCount。因此有
bow(a_) = (discount1 * observedVocab / totalCount)
2.8) Modified Kneser-Ney 折扣算法
class ModKneserNey: public KneserNey
{
public:
/** 估算回退平滑时的折扣系数的函数 */
double discount(Count count, Count totalCount, Count observedVocab);
/** 估算插值平滑时,计算低阶ngram权重的函数 */
double lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab);
/** 用于估算Modified Kneser-Ney平滑算法中的三个常数 */
Boolean estimate(NgramStats &counts, unsigned order);
protected:
/** 除discount1之外的其他两个discount参数 */
double discount2;
double discount3plus;
};
估算Modified Kneser-Ney平滑算法中的三个常数
<src>
Boolean
ModKneserNey::estimate(NgramStats &counts, unsigned order)
{
if (!prepareCountsAtEnd) {
prepareCounts(counts, order, counts.getorder());
}
/*
* First tabulate count-of-counts
*/
Count n1 = 0;
Count n2 = 0;
Count n3 = 0;
Count n4 = 0;
makeArray(VocabIndex, wids, order + 1);
NgramsIter iter(counts, wids, order);
NgramCount *count;
while (count = iter.next()) {
if (counts.vocab.isNonEvent(wids[order - 1])) {
continue;
} else if (counts.vocab.isMetaTag(wids[order - 1])) {
unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
/*
* process count-of-count
*/
if (type == 1) {
n1 ++;
} else if (type == 2) {
n2 ++;
} else if (type == 3) {
n3 ++;
} else if (type == 4) {
n4 ++;
}
} else if (*count == 1) {
n1 ++;
} else if (*count == 2) {
n2 ++;
} else if (*count == 3) {
n3 ++;
} else if (*count == 4) {
n4 ++;
}
}
if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
dout() << "Kneser-Ney smoothing " << order << "-grams\n"
<< "n1 = " << n1 << endl
<< "n2 = " << n2 << endl
<< "n3 = " << n3 << endl
<< "n4 = " << n4 << endl;
}
if (n1 == 0 || n2 == 0 || n3 == 0 || n4 == 0) {
cerr << "one of required modified KneserNey count-of-counts is zero\n";
return false;
}
/*
* Compute discount constants (Ries 1997, Chen & Goodman 1998)
*/
double Y = (double)n1/(n1 + 2 * n2);
discount1 = 1 - 2 * Y * n2 / n1;
discount2 = 2 - 3 * Y * n3 / n2;
discount3plus = 3 - 4 * Y * n4 / n3;
if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
dout() << "D1 = " << discount1 << endl
<< "D2 = " << discount2 << endl
<< "D3+ = " << discount3plus << endl;
}
if (discount1 < 0.0 || discount2 < 0.0 || discount3plus < 0.0) {
cerr << "one of modified KneserNey discounts is negative\n";
return false;
}
if (prepareCountsAtEnd) {
prepareCounts(counts, order, counts.getorder());
}
return true;
}
</src>
功能:估算Modified Kneser-Ney平滑算法中的三个常数 discount1 discount2 discount3plus
根据Modified Kneser-Ney折扣平滑算法这个三个常数的计算公式,有:
Y = n1/(n1+2*n2)
D1 = 1 - 2Y(n2/n1)
D2 = 2 - 3Y(n3/n2)
D3+ = 3 - 4Y(n4/n3)
上述等式中D1相当于discount1,D2相当于discount2,D3相当于discount3plus,因此
该函数的主要用于统计n1、n2、n3、n4这个几个常数,并根据其计算获得discount1,discount2
和discount3plus。
估算回退平滑时的折扣系数
<src>
double
ModKneserNey::discount(Count count, Count totalCount, Count observedVocab)
{
if (count <= 0) {
return 1.0;
} else if (count < minCount) {
return 0.0;
} else if (count == 1) {
return (count - discount1) / count;
} else if (count == 2) {
return (count - discount2) / count;
} else {
return (count - discount3plus) / count;
}
}
</src>
功能:估算回退平滑时的折扣系数
1) 当count <= 0 时,直接返回1.0表示不做折扣计算
2) 当count < _mincount时,直接返回0.0,表示折扣为0
3) 由于当前为Modified Kneser-Ney算法,所以对于
count为1的ngram其概率值存在如下计算公式
f(a_z) = (c(a_z) - D1) / c(a_) (参考Ngram折扣平滑计算文档)
上式中c(a_z)相当于count,D1相当于discount1,c(a_)相当于totalCount
又因为p(a_z) = c(a_z)/c(a_),因此有c(a_) * f(a_z)= c'(a_z),这里的c'(a_z)
为折扣以后的频数值,因此有:
c'(a_z) = c(a_) * (c(a_z) - D1) / c(a_) = (c(a_z) - D1)
又因为折扣系数d
d = c'(a_z)/c(a_z) = (c(a_z) - D1)/ c(a_z)
= (count - discount1) / count;
对于count为2的ngram其概率值存在如下计算公式
f(a_z) = (c(a_z) - D2) / c(a_) (参考Ngram折扣平滑计算文档)
因此通过和上面类似的推导,可得:
d = (count - discount2) / count;
对于count为3和大于3的ngram其概率值存在如下计算公式
f(a_z) = (c(a_z) - D3+) / c(a_) (参考Ngram折扣平滑计算文档)
因此通过和上面类似的推导,可得:
d = (count - discount3plus) / count;
估算插值平滑时,计算低阶ngram权重的函数
<src>
double
ModKneserNey::lowerOrderWeight(Count totalCount, Count observedVocab,
Count min2Vocab, Count min3Vocab)
{
return (discount1 * (observedVocab - min2Vocab) +
discount2 * (min2Vocab - min3Vocab) +
discount3plus * min3Vocab) / totalCount;
}
</src>
功能:估算插值平滑时,计算低阶ngram权重
根据Modified Kneser-Ney插值平滑对应的低阶ngram权重计算公式,有:
(discount1 * (observedVocab - min2Vocab) +
discount2 * (min2Vocab - min3Vocab) +
discount3plus * min3Vocab) / totalCount;
关于该计算公式Ngram折扣平滑算法中并无详细说明,因此目前还无法
得知该公式的正确性。

