在上一篇文章中介绍了log likehood相似度函数,这里在贴上代码,这份代码是参考了mahout代码实现,想看mahout在这个源码实现的可以去看Apache官方源码,也是比较好理解的。
话不多说直接上代码,也是比较简单,熵是非归一化的,区别于常规的熵计算
def entropy(*elements): sum = 0 result = 0.0 for element in elements: result += xLogX(element) sum += element return xLogX(sum) - result def xLogX(x)->float: return 0.0 if x==0 else x * math.log(x) def checkargs(*args): for x in args: if x<0: raise ValueError def logLikelihoodRatio(k11, k12,k21,k22)->float: checkargs(k11,k12,k21,k22) #note that we have counts here, not probabilities, and that the entropy is not normalized. rowEntropy = entropy(k11 + k12, k21 + k22); columnEntropy = entropy(k11 + k21, k12 + k22); matrixEntropy = entropy(k11, k12, k21, k22); if rowEntropy + columnEntropy < matrixEntropy: #round off error return 0.0 return 2.0 * (rowEntropy + columnEntropy - matrixEntropy)