Srilm 阅读文档15-Version2

取自 自然语言处理百科

跳转到: 导航, 搜索

Discount.cc Discount.h

文档作者:jianzhu

修改时间: 08.12.15-08.12.17

注:本文档改写自rickjin书写的Discount文档修正了原文档中存在的一些错误,并扩充了部分内容;阅读该文档前,建议先阅读srilm的ngram-discount.7.html手册;或者查看jianzhu的翻译文档Ngram折扣平滑算法


1、基本类


Discount.h Discount.cc 这两个文件主要实现了最重要的几个折扣算法, 包括

   a. Katz Discount (基于 Good-Turing Discounting)
   b. Absolute Discounting
   c. Natural law of succession [Eric Sven Ristad, 1995]
   d. Additive Discounting [Lidstone-Johnson-Jeffrey]
   e  Witten-Bell Discounting
   f. Kneser-Ney  Discounting
   g. Modified Kneser-Ney Discounting, [Chen, Goodman, 1998]

2、类接口说明


2.1) Discount 类主要接口

<src>
class Discount
{
public:
 /**
  * @brief 构造函数
  * 通过成员初始化列表的方式将interpolate初始化为false,表示不支持插值平滑
  */
 Discount() : interpolate(false) {};
   virtual ~Discount() {};
  
   /* 使用折扣算法, 返回频率 count 对应的折扣系数
    * @param count             当前 ngram 频率   ( C(a_z) )
    * @param totalCount        历史 n-1gram 频率 ( C(a_) )
    * @param oberservedVocab   语料中统计到的历史ngram a_ 后接的 z 的类型数
    * */
   virtual double discount(Count count, Count totalCount, Count observedVocab);
   virtual double discount(FloatCount count, FloatCount totalCount,
                           Count observedVocab);
   /* 在插值模型中, 低阶模型的权值大小
    *
    * @param min2Vocab  频率>=2 的ngram type 数目
    * @param min3Vocab  频率>=3 的ngram type 数目
    * */
   virtual double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab);
   virtual double lowerOrderWeight(FloatCount totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab);
   /* 是否支持该折扣操作 */
   virtual Boolean nodiscount();
   /* 保存折扣系数 */
   virtual void write(File &file);
   /* 读入折扣系数 */
   virtual Boolean read(File &file);
   /* 计算折扣系数
    *
    * @param counts   ngram 计数器
    * @param order    对 order 阶ngram 进行平滑计算
    * */
   virtual Boolean estimate(NgramStats &counts, unsigned order);
   virtual Boolean estimate(NgramCounts<FloatCount> &counts, unsigned order);
   /* 在平滑之前, 对 ngram 的频率进行调整,在 KneserNey 平滑算法中使用 */
   virtual void prepareCounts(NgramCounts<NgramCount> &counts,
   unsigned order, unsigned maxOrder);
   virtual void prepareCounts(NgramCounts<FloatCount> &counts,
   unsigned order, unsigned maxOrder);
   /* 是否支持插值,这里将interpolate置为public区域,是为子类的继承 */
   Boolean interpolate;
protected:
   /* Vocab 中有效条目的 ngram 数目 */
   static unsigned vocabSize(Vocab &vocab);
};
<src>

2.2) GoodTuring 类主要接口

<src>
class GoodTuring: public Discount
{
public:
   /* 构造函数,该构造函数中并未修改从父类继承的interploate值,
    * 因此interpolate为false,表示不支持插值平滑
    */
   GoodTuring(unsigned mincount = GT_defaultMinCount,
           unsigned maxcount = GT_defaultMaxCount);
          
  /* 折扣系数估算函数
   * jianzhu added 2008-12-15
   */
  Boolean estimate(NgramStats &counts, unsigned order);
protected:
   /* 小于该值的频率将被设置为 0 */
   Count minCount;
   /* 大于该值的频率将不平滑, 保持不变 */
   Count maxCount;
   /* 平滑系数 */
   Array<double> discountCoeffs;
};
</src>
说明:
a. 该函数实现标准的基于图灵平滑的 Katz 平滑算法
b. 图灵平滑算法为 r* = (r + 1) * n_(r+1) / n_r, Katz 平滑算法如下
            |-- d_r * r  (0 < r < k)
   r_katz = |-- r        (r > k)
            |-- alpha    (r = 0)
   d_r = (r*/r - commonTerm ) / (1 -  commonTerm)
   commonTerm = (k+1) * n_(k+1) / n_1
   在此处 k 默认值为 GT_defaultMaxCount = 5;

折扣系数估算函数

<src>
Boolean
GoodTuring::estimate(NgramStats &counts, unsigned order)
{
   Array<Count> countOfCounts;
   /*
   * First tabulate count-of-counts for the given order of ngrams
   * Note we need GT count for up to maxCount + 1 inclusive, to apply
   * the GT formula for counts up to maxCount.
   */
   makeArray(VocabIndex, wids, order + 1);
   NgramsIter iter(counts, wids, order);
   NgramCount *count;
   Count i;
   for (i = 0; i <= maxCount + 1; i++) {
       countOfCounts[i]  = 0;
   }
   while (count = iter.next()) {
       if (counts.vocab.isNonEvent(wids[order - 1])) {
           continue;
       } else if (counts.vocab.isMetaTag(wids[order - 1])) {
           unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
           /*
           * process count-of-count
           */
           if (type > 0 && type <= maxCount + 1) {
               countOfCounts[type] += *count;
           }
       } else if (*count <= maxCount + 1) {
           countOfCounts[*count] ++;
       }
   }
   if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
       dout() << "Good-Turing discounting " << order << "-grams\n";
       for (i = 0; i <= maxCount + 1; i++) {
           dout() << "GT-count [" << i << "] = " << countOfCounts[i] << endl;
       }
   }
   if (countOfCounts[1] == 0) {
       cerr << "warning: no singleton counts\n";
       maxCount = 0;
   }
   while (maxCount > 0 && countOfCounts[maxCount + 1] == 0) {
       cerr << "warning: count of count " << maxCount + 1 << " is zero "
           << "-- lowering maxcount\n";
       maxCount --;
   }
   if (maxCount <= 0) {
       cerr << "GT discounting disabled\n";
   } else {
       double commonTerm = (maxCount + 1) *
           (double)countOfCounts[maxCount + 1] /
           (double)countOfCounts[1];
       for (i = 1; i <= maxCount; i++) {
           double coeff;
           if (countOfCounts[i] == 0) {
               cerr << "warning: count of count " << i << " is zero\n";
               coeff = 1.0;
           } else {
               double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
                   (i * (double)countOfCounts[i]);
               coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
               if (!finite(coeff) || coeff <= Prob_Epsilon || coeff0 > 1.0) {
                   cerr << "warning: discount coeff " << i
                       << " is out of range: " << coeff << "\n";
                   coeff = 1.0;
               }
           }
           discountCoeffs[i] = coeff;
       }
   }
   return true;
}
</src>
 功能:估算NgramStats中特定元数(order)的,从1到maxCount的所有折扣系数,并
       将该折扣系数保存到discountCoeffs数组中
 
 细解:
 注:阅读以下部分,如有疑问可以参考srilm ngram-discount.7.html文档,或jianzhu的
 Ngram折扣平滑算法翻译文档。另外关于katz的折扣计算公式可以参考speech and language
 processing一书的第四章 N-GRAMS
 
 commonTerm = (maxCount + 1) * (double)countOfCounts[maxCount+1]/(double)countOfCounts[1];
  这里:
 commonTerm   -- A
 maxCount    -- gtmax
 countOfCounts[maxCount+1] -- countOfCounts[gtmax+1] -- n[gtmax+1]
 countOfCounts[1] -- n[1]
 因此上式即转化为
 A = (gtmax+1)*(double)n[gtmax+1]/(double)n[1]
 coeff0 = (i+1) * (double)countOfCounts[i+1]/(i * (double)countOfCounts[i]);
 这里:
 i  -- c(a_z)
 i+1 -- c(a_z)+1
 countOfCounts[i+1] -- n[c(a_z)+1]
 countOfCounts[i]   -- n[c(a_z)]
 因此
 coeff0 = (c(a_z)+1) * (double)n[c(a_z)+1]/(c(a_z) * (double)n[c(a_z)])
 这里coeff0 相当远 r*/r
 coeff = (coeff0 - commnTerm)/(1.0 - commonTerm)
       = [ (c(a_z)+1) * (double)n[c(a_z)+1]/(c(a_z) * (double)n[c(a_z)])
          - (gtmax+1)*(double)n[gtmax+1]/(double)n[1] ] / [ 1.0 - (gtmax+1)* (double)n[gtmax+1]/(double)n[1] ]
 这里 coeff 相当于 d_r = (r*/r - commonTerm ) / (1 -  commonTerm);
 注上述公式根据katz折扣算法而来,该算法的公式如下
 C' = [ (c(a_z)+1) * (double)n[c(a_z)+1]/(double)n[c(a_z)]
          - c(a_z)*(gtmax+1)*(double)n[gtmax+1]/(double)n[1] ] / [ 1.0 - (gtmax+1)* (double)n[gtmax+1]/(double)n[1] ]
又因为有coeff = C'/C,因为目标为 C * coeff -> C'
 

2.3) ConstDisCount 类主要接口

<src>
class ConstDiscount: public Discount
{
public:
   double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab)
     { return _discount * observedVocab / totalCount; }
protected:
   /* 打折的常数项 */
   double _discount;
   /* 低于此值的频率被设置为 0 */
   double _mincount;
};
说明:
a. 此折扣算法即为 Absolute Discounting
b. 算法思想就是对每个频数count, 将其减去一个常数 _discount, _discount介于
  0和1之间,所以变为 (count - _discount), 则折扣系数为  (count-_discount) / count
c. 常数_discount一般设为
  _discount = n1 / (n1 + 2*n2)
  上式中
 n1代表出现次数为1次的ngram数目
  n2代表出现次数为2次的ngram数目

该类中用于估算折扣系数的函数

<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
 return (count <= 0) ? 1.0 :
                       (count < _mincount) ? 0.0 :
                                          (count - _discount) / count; };
</src>
   功能:上述函数用于返回频数为count的ngram的折扣系数
   1) 若count <= 0 则返回折扣系数1.0,表示不折扣
   2) 若count < _mincount 则返回折扣系数0.0,表示将该频数折扣为0
   3) 否则根据公式 (count - _discount) / count 计算折扣系数并返回。

该类由于支持插值平滑,因此有用于估算插值平滑时,低阶ngram权重的函数

<src>
double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab)
{
  return _discount * observedVocab / totalCount;
}
</src>
   功能:计算用于估算插值平滑时,低阶ngram权重的函数
   为了理解该函数,可以参考Ngram折扣平滑算法文档中Ney的绝对折扣算法这一部分
   bow(a_) = 1 - Sum_Z1 g(a_z) ; Eqn.5
           = D n(a_*) / c(a_)
   因此 _discount -- D
        n(a_*)    -- observedVocab
        c(a_)     -- totalCount
这里有一个疑问就是Absolute Discount是支持插值平滑算法的,但该类中并未看到将
interpolate设为true的函数,后面的其他折扣算法类也有同样的问题,我猜测可能是
因为父类存在lowerOrderWeight函数,而且返回为0.0,若子类未重新定义该函数,则 
沿用父类中的函数,直接返回0.0,表示不支持插值平滑。

2.4) NaturalDiscount 类主要接口

<src>
class NaturalDiscount: public Discount
{
public:
    ...
protected:
   /* ngram 的条目数 */
   unsigned _vocabSize;     /* vocabulary size */
   /* 低于此值的频率被设置为 0 */
   double _mincount;      /* minimum count to retain */
};
</src>
说明:
a. 此平滑算法为 Natural law of succession [Eric Sven Ristad, 1995]
b. 算法思想就是对每个频率count, 使用如下固定的平滑系数
   d = ( n(n+1) + q(1-q) ) / ( n^2 + n + 2q )
   其中  n 为 totalCount(总频率), q 为 observedVocab(ngram type 数)
  

该类中用于估算折扣系数的函数

<src>
double
NaturalDiscount::discount(Count count, Count totalCount, Count observedVocab)
{
   double n = totalCount;
   double q = observedVocab;
   if (count < _mincount) {
   return 0.0;
   } else if (q == _vocabSize) {
   return 1.0;
   } else {
   return (n * (n+1) + q * (1 - q)) / (n * (n + 1) + 2 * q);
   }
}
</src>
   功能:上述函数用于返回频数为count的ngram的折扣系数
   1) 若count 小于_mincount值,则返回折扣系数0.0,表示将该频数折扣为0
   2) n 和 q 介绍
      要理解这里的 n 和 q 首先需要了解一下Natural Discount的公式
      该公式如下表示
      c(a_)(c(a_) + 1) + n(a_*)(1 - n(a_*))
      ---------------------------------------
      c(a_)^2 + c(a_) + 2 n(a_*)
     
      因此 n 就相当于上述公式中的c(a_),q 相当于上述公式中的 n(a_*)
      因此 n 就相当于所有出现的n元 c(a_*) token的频数和
           q 就相当于所有出现的n元 c(a_*) 中 * 这个词的种类数,即不同
           的n元 c(a_*)的个数
     

2.5) AddSmooth 类主要接口

class AddSmooth: public Discount
{
public:
   AddSmooth(double delta = 1.0, unsigned mincount = 0)
: _delta(delta < 0.0 ? 0.0 : delta),
  _mincount(mincount) {};
protected:
   double _delta;      /* the additive constant */
   /* 低于此值的频率被设置为 0 */
   double _mincount;      /* minimum count to retain */
   /* ngram 的条目数 */
   unsigned _vocabSize;     /* vocabulary size */
}; 
说明:
a. 此平滑算法为 Additive smoothing, 可以视为 Laplace add 1 smoothing 的扩展
b. 算法思想就是对每个频率c, 使用如下方法计算概率  
   p = (c + delta) / (T + N * delta)
   其中  T 为 totalCount(总频率), N 为 _vocabSize(ngram 中包含的不同词条数)
   由此, 折扣系数为
   d = (1 + delta/c) / (1 + N * delta / T)
c. 该折扣算法可以看作是最大似然估计ML 和 均匀分布之间的插值
   p = lambda * p_ML  + (1 - lambda) * 1/N
  

计算折扣系数的函数

<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
 return (count <= 0) ?
        1.0 : (count < _mincount) ?
        0.0 : (1.0 + _delta/count) / (1.0 + _vocabSize*_delta/totalCount);
}
</src>
   功能:上述函数用于返回频数为count的ngram的折扣系数
   1) 若count <= 0,则返回1.0,表示不做任何折扣
   2) 若count < _mincount,这返回0.0,表示将频数折扣为0
   3) 否则返回(1.0 + _delta/count) / (1.0 + _vocabSize*_delta/totalCount)
      上述公式即为中count 即为c(a_z),totalCount即为c(a_)。c(a_) 即为所有
      c(a_*)的叠加,_vocabSize即为所有n元包含的不同词条数。
  
     

2.6) Witten-Bell 折扣算法

class WittenBell: public Discount
{
public:
   WittenBell(unsigned mincount = 0) : _mincount(mincount) {};
 
 /** 用于估算回退平滑时的折扣系数 */
   double discount(Count count, Count totalCount, Count observedVocab);
  
   /** 用于估算插值平滑时,计算低阶ngram权重的函数 */
   double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab);

protected:
 /* 低于此频数的频数被置为0 */
   double _mincount;      /* minimum count to retain */
};

用于估算回退平滑时的折扣系数

<src>
double discount(Count count, Count totalCount, Count observedVocab)
{
  return (count <= 0) ? 1.0 : (count < _mincount) ? 0.0 :
        ((double)totalCount / (totalCount + observedVocab));
}
</src>
 功能:估算回退平滑时的折扣系数
 1) 当count <= 0 时,直接返回1.0表示不做折扣计算
 2) 当count < _mincount时,直接返回0.0,表示折扣为0
 3) 否则witten-bell回退平滑下有如下概率折扣计算公式
     f(a_z) = c(a_z) / (n(a_*) + c(a_)) (参考Ngram折扣平滑算法文档)
    上式中c(a_z)相当于count,n(a_*)相当于observedVocab,c(a_)相当于totalCount
    又因为p(a_z) = c(a_z)/c(a_),因此有f(a_z)*c(a_) = c'(a_z),这里的c'(a_z)
    为折扣以后的频数值,因此有:
    c'(a_z) = c(a_z)*c(a_) / (n(a_*) + c(a_))
    又因为折扣系数d
    d = c'(a_z)/c(a_z) = c(a_) / (n(a_*) + c(a_))
      = ((double)totalCount / (totalCount + observedVocab))

用于插值平滑时,计算低阶ngram权重的函数

<src>
double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab)
{
  return (double)observedVocab / (totalCount + observedVocab);
}
</src>
   功能:用于插值平滑时,计算低阶ngram权重的函数
   witten-bell插值平滑下,低阶权重有如下估算公式:
   bow(a_) = n(a_*) / (n(a_*) + c(a_)) (参考Ngram折扣平滑算法文档)
   上式中n(a_*)相当于observedVocab,c(a_)相当于totalCount,因此有:
   (double)observedVocab / (totalCount + observedVocab)
  

2.7) Kneser-Ney 折扣算法

class KneserNey: public Discount
{
public:
  /** 用于估算回退平滑时的折扣系数 */
   virtual double discount(Count count, Count totalCount, Count observedVocab);
  
   /** 用于估算插值平滑时,计算低阶ngram权重的函数 */
   virtual double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab);
 /* 用于计算Kneser-Ney折扣算法中折扣常数discount1 */
   virtual Boolean estimate(NgramStats &counts, unsigned order);
protected:
 /* 低于此数值的频数将被设为0 */
   Count minCount;
 
 /* 折扣常数 */
   double discount1; 
};

折扣常数计算函数

<src>
Boolean
KneserNey::estimate(NgramStats &counts, unsigned order)
{
   if (!prepareCountsAtEnd) {
       prepareCounts(counts, order, counts.getorder());
   }
   /*
   * First tabulate count-of-counts
   */
   Count n1 = 0;
   Count n2 = 0;
   makeArray(VocabIndex, wids, order + 1);
   NgramsIter iter(counts, wids, order);
   NgramCount *count;
   while (count = iter.next()) {
       if (counts.vocab.isNonEvent(wids[order - 1])) {
           continue;
       } else if (counts.vocab.isMetaTag(wids[order - 1])) {
           unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
           /*
           * process count-of-count
           */
           if (type == 1) {
               n1 ++;
           } else if (type == 2) {
               n2 ++;
           }
       } else if (*count == 1) {
           n1 ++;
       } else if (*count == 2) {
           n2 ++;
       }
   }
   if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
       dout() << "Kneser-Ney smoothing " << order << "-grams\n"
           << "n1 = " << n1 << endl
           << "n2 = " << n2 << endl;
   }
   if (n1 == 0 || n2 == 0) {
       cerr << "one of required KneserNey count-of-counts is zero\n";
       return false;
   }
   discount1 = n1/((double)n1 + 2*n2);
   if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
       dout() << "D = " << discount1 << endl;
   }
   if (prepareCountsAtEnd) {
       prepareCounts(counts, order, counts.getorder());
   }
   return true;
}
</src>
   功能:根据Kneser-Ney折扣常数计算公式有:
         D = n1 / (n1 + 2*n2)    (参考Ngram折扣平滑算法文档)
   上式中:
   D为discount1
   n1代表出现次数为1次的ngram数目
 n2代表出现次数为2次的ngram数目

估算回退平滑时的折扣系数的函数

<src>
double
KneserNey::discount(Count count, Count totalCount, Count observedVocab)
{
   if (count <= 0) {
       return 1.0;
   } else if (count < minCount) {
       return 0.0;
   } else {
       return (count - discount1) / count;
   }
}
</src>
  功能:估算回退平滑时的折扣系数
  1) 当count <= 0 时,直接返回1.0表示不做折扣计算
  2) 当count < _mincount时,直接返回0.0,表示折扣为0
  3) 否则根据Knesery-Ney的回退平滑时对应的折扣概率计算公式
    f(a_z) = (c(a_z) - D0) / c(a_)   (参考Ngram折扣平滑算法文档)
    上式中c(a_z)相当于count, D0相当于discount1,c(a_)相当于totalCount
    又因为p(a_z) = c(a_z)/c(a_),因此有c(a_) * f(a_z)= c'(a_z),这里的c'(a_z)
   为折扣以后的频数值,因此有:
   c'(a_z) = c(a_) * (c(a_z) - D0) / c(a_) = (c(a_z) - D0)
   又因为折扣系数d
   d = c'(a_z)/c(a_z) = (c(a_z) - D0)/ c(a_z)
     = (count - discount1) / count;
   

估算插值平滑时,计算低阶ngram权重的函数

<src>
double
KneserNey::lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab)
{
   return (discount1 * observedVocab / totalCount);
}
</src>
 功能:估算插值平滑时,计算低阶ngram权重
 根据Kneser-Ney插值平滑时对应的低阶ngram权重计算公式:
 bow(a_) = D n(a_*) / c(a_)
 上式中 D 相当于discount1,n(a_*)相当于observedVocab
 c(a_)相当于totalCount。因此有
 bow(a_) = (discount1 * observedVocab / totalCount)

2.8) Modified Kneser-Ney 折扣算法

class ModKneserNey: public KneserNey
{
public:
  /** 估算回退平滑时的折扣系数的函数 */
   double discount(Count count, Count totalCount, Count observedVocab);

   /** 估算插值平滑时,计算低阶ngram权重的函数 */
   double lowerOrderWeight(Count totalCount, Count observedVocab,
        Count min2Vocab, Count min3Vocab);
       
 /** 用于估算Modified Kneser-Ney平滑算法中的三个常数 */
   Boolean estimate(NgramStats &counts, unsigned order);
 
protected:
 /** 除discount1之外的其他两个discount参数 */
   double discount2; 
   double discount3plus;
}; 

估算Modified Kneser-Ney平滑算法中的三个常数

<src>
Boolean
ModKneserNey::estimate(NgramStats &counts, unsigned order)
{
   if (!prepareCountsAtEnd) {
       prepareCounts(counts, order, counts.getorder());
   }
   /*
   * First tabulate count-of-counts
   */
   Count n1 = 0;
   Count n2 = 0;
   Count n3 = 0;
   Count n4 = 0;
   makeArray(VocabIndex, wids, order + 1);
   NgramsIter iter(counts, wids, order);
   NgramCount *count;
   while (count = iter.next()) {
       if (counts.vocab.isNonEvent(wids[order - 1])) {
           continue;
       } else if (counts.vocab.isMetaTag(wids[order - 1])) {
           unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);
           /*
           * process count-of-count
           */
           if (type == 1) {
               n1 ++;
           } else if (type == 2) {
               n2 ++;
           } else if (type == 3) {
               n3 ++;
           } else if (type == 4) {
               n4 ++;
           }
       } else if (*count == 1) {
           n1 ++;
       } else if (*count == 2) {
           n2 ++;
       } else if (*count == 3) {
           n3 ++;
       } else if (*count == 4) {
           n4 ++;
       }
   }
   if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
       dout() << "Kneser-Ney smoothing " << order << "-grams\n"
           << "n1 = " << n1 << endl
           << "n2 = " << n2 << endl
           << "n3 = " << n3 << endl
           << "n4 = " << n4 << endl;
   }
   if (n1 == 0 || n2 == 0 || n3 == 0 || n4 == 0) {
       cerr << "one of required modified KneserNey count-of-counts is zero\n";
       return false;
   }
   /*
   * Compute discount constants (Ries 1997, Chen & Goodman 1998)
   */
   double Y = (double)n1/(n1 + 2 * n2);
   discount1 = 1 - 2 * Y * n2 / n1;
   discount2 = 2 - 3 * Y * n3 / n2;
   discount3plus = 3 - 4 * Y * n4 / n3;
   if (debug(DEBUG_ESTIMATE_DISCOUNT)) {
       dout() << "D1 = " << discount1 << endl
           << "D2 = " << discount2 << endl
           << "D3+ = " << discount3plus << endl;
   }
   if (discount1 < 0.0 || discount2 < 0.0 || discount3plus < 0.0) {
       cerr << "one of modified KneserNey discounts is negative\n";
       return false;
   }
   if (prepareCountsAtEnd) {
       prepareCounts(counts, order, counts.getorder());
   }
   return true;
}
</src>
   功能:估算Modified Kneser-Ney平滑算法中的三个常数 discount1 discount2 discount3plus
   根据Modified Kneser-Ney折扣平滑算法这个三个常数的计算公式,有:
   Y = n1/(n1+2*n2)
 D1 = 1 - 2Y(n2/n1)
 D2 = 2 - 3Y(n3/n2)
 D3+ = 3 - 4Y(n4/n3)
 上述等式中D1相当于discount1,D2相当于discount2,D3相当于discount3plus,因此
 该函数的主要用于统计n1、n2、n3、n4这个几个常数,并根据其计算获得discount1,discount2
 和discount3plus。

估算回退平滑时的折扣系数

<src>
double
ModKneserNey::discount(Count count, Count totalCount, Count observedVocab)
{
   if (count <= 0) {
       return 1.0;
   } else if (count < minCount) {
       return 0.0;
   } else if (count == 1) {
       return (count - discount1) / count;
   } else if (count == 2) {
       return (count - discount2) / count;
   } else {
       return (count - discount3plus) / count;
   }
}
</src>
 功能:估算回退平滑时的折扣系数
 1) 当count <= 0 时,直接返回1.0表示不做折扣计算
  2) 当count < _mincount时,直接返回0.0,表示折扣为0
  3) 由于当前为Modified Kneser-Ney算法,所以对于
  count为1的ngram其概率值存在如下计算公式
  f(a_z) = (c(a_z) - D1) / c(a_)  (参考Ngram折扣平滑计算文档)
  上式中c(a_z)相当于count,D1相当于discount1,c(a_)相当于totalCount
  又因为p(a_z) = c(a_z)/c(a_),因此有c(a_) * f(a_z)= c'(a_z),这里的c'(a_z)
 为折扣以后的频数值,因此有:
 c'(a_z) = c(a_) * (c(a_z) - D1) / c(a_) = (c(a_z) - D1)
 又因为折扣系数d
 d = c'(a_z)/c(a_z) = (c(a_z) - D1)/ c(a_z)
   = (count - discount1) / count;
 
 对于count为2的ngram其概率值存在如下计算公式
 f(a_z) = (c(a_z) - D2) / c(a_)  (参考Ngram折扣平滑计算文档)
 因此通过和上面类似的推导,可得:
 d = (count - discount2) / count;
 
 对于count为3和大于3的ngram其概率值存在如下计算公式
 f(a_z) = (c(a_z) - D3+) / c(a_)  (参考Ngram折扣平滑计算文档)
 因此通过和上面类似的推导,可得:
 d = (count - discount3plus) / count;
 

估算插值平滑时,计算低阶ngram权重的函数

<src>
double
ModKneserNey::lowerOrderWeight(Count totalCount, Count observedVocab,
                              Count min2Vocab, Count min3Vocab)
{
   return (discount1 * (observedVocab - min2Vocab) +
       discount2 * (min2Vocab - min3Vocab) +
       discount3plus * min3Vocab) / totalCount;
}
</src>
 功能:估算插值平滑时,计算低阶ngram权重
 根据Modified Kneser-Ney插值平滑对应的低阶ngram权重计算公式,有:
 (discount1 * (observedVocab - min2Vocab) +
       discount2 * (min2Vocab - min3Vocab) +
       discount3plus * min3Vocab) / totalCount;
   关于该计算公式Ngram折扣平滑算法中并无详细说明,因此目前还无法
   得知该公式的正确性。


转自:http://blog.chinaunix.net/u1/58264/showart_1731994.html

个人工具
工具箱