文章/答案/技术大牛

发布

问可预见性上限
EN

Stack Overflow用户

提问于 2017-05-19 18:35:27

回答 1查看 384关注 0票数 0

我试图计算我的占用率数据集的可预测性上限，就像宋的“人类移动中的可预见性极限”论文中的那样。基本上，“家”(=1)和“不在家”(=0)代表宋的论文中访问过的地点(塔)。

我在一个随机二进制序列上测试了我的代码(我是从https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds和https://github.com/gavin-s-smith/EntropyRateEst派生的)，该序列应该返回熵为1，可预测性为0.5。相反，返回的熵为0.87，预测值为0.71。

这是我的密码：

import numpy as np
from scipy.optimize import fsolve
from cmath import log 
import math

def matchfinder(data):
    data_len = len(data)    
    output = np.zeros(len(data))
    output[0] = 1

    # Using L_{n} definition from
    #"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
    # by Kontoyiannis et. al.
    # $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$

    # for each position, i, in the sub-sequence that occurs before the current position, start_idx
    # check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx

    for start_idx in range(1,data_len):
        max_subsequence_matched = 0
        for i in range(0,start_idx):
            #    for( int i = 0; i < start_idx; i++ )
            #    {
            j = 0

            #increase the length of the substring starting at j and start_idx
            #while they are the same keeping track of the length
            while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
                j = j + 1

            if j > max_subsequence_matched:     
                max_subsequence_matched = j;

        #L_{n} is obtained by adding 1 to the longest match-length
        output[start_idx] = max_subsequence_matched + 1;    

    return output

if __name__ == '__main__':
    #Read dataset            
    data = np.random.randint(2,size=2000)

    #Number of distinct locations
    N = len(np.unique(data))

    #True entropy
    lambdai = matchfinder(data)
    Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)

    S = Etrue
    #use Fano's inequality to compute the predictability
    func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S 
    ub = fsolve(func, 0.9)[0]
    print ub

匹配器函数通过查找最长的匹配来查找熵，并将1添加到其中(=以前未见过的最短子字符串)。然后利用Fano不等式计算可预测性。

有什么问题吗？

谢谢!

python

entropy

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-01-13 22:11:15

熵函数似乎是错误的。参考论文Song，C.，Qu .，Z.，Blumm，N.，& Barabási，A.L. (2010)。人力流动的可预测性限制。科学，327(5968)，1018-1021。您刚才提到，实熵是通过基于Lempel-Ziv数据压缩的算法来估计的：

[](https://chart.googleapis.com/chart?cht=tx&chl=S%5E%7Best%7D%3D%5Cleft%20(%20%5Cfrac%7B1%7D%7Bn%7D%20%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%5CLambda_i%5Cright%20%29%5E%7B-1%7D%5Cln(n%29)

在代码中，它将如下所示：

Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real

其中n是时间序列的长度。

请注意，我们对对数使用的基数与给定公式中的基数不同。然而，由于Fano不等式中的对数基是2，所以用同样的基来计算熵是合乎逻辑的。另外，我不知道为什么你从第一个开始和，而不是零指数。

因此，现在将其封装到函数中，例如：

def solve(locations, size):
    data = np.random.randint(locations,size=size)
    N = len(np.unique(data))
    n = float(len(data))
    print "Distinct locations: %i" % N
    print "Time series length: %i" % n

    #True entropy
    lambdai = matchfinder(data)
    #S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
    Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
    S = Etrue
    print "Maximum entropy: %2.5f" % log(locations,2).real
    print "Real entropy: %2.5f" % S

    func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
    ub = fsolve(func, 0.9)[0]
    print "Upper bound of predictability: %2.5f" % ub
    return ub

两个位置的输出

Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013

3个位置的输出

Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172

当n接近无穷大时，Lempel-Ziv压缩收敛到实际熵，这就是为什么对于两个位置的情况，它略高于最大值。

我也不确定你是否正确地解释了lambda的定义。它被定义为“从位置I开始的最短子字符串的长度，以前从位置1到I-1没有出现”，所以当我们到了进一步的子字符串不再唯一的某个点时，您的匹配算法将给出它的长度总是比子字符串的长度高一个，而它应该相当等于0，因为不存在唯一的子字符串。

为了让它更清楚，让我们举一个简单的例子。如果位置数组看起来是这样的：

[1 0 0 1 0 0]

然后我们可以看到，在前三个位置之后，模式再次被重复。这意味着从第四个位置开始，最短唯一子字符串不存在，因此它等于0。因此，输出(lambda)应该如下所示：

[1 1 2 0 0 0]

但是，在这种情况下，您的函数将返回：

[1 1 2 4 3 2]

我重写了匹配函数来处理这个问题：

def matchfinder2(data):
    data_len = len(data)
    output = np.zeros(len(data))
    output[0] = 1
    for start_idx in range(1,data_len):
        max_subsequence_matched = 0
        for i in range(0,start_idx):
            j = 0
            end_distance = data_len - start_idx #length left to the end of sequence (including current index)
            while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
                j = j + 1
            if j == end_distance: #check if j has reached the end of sequence
                output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
                return output #end function
            elif j > max_subsequence_matched:
                max_subsequence_matched = j;
        output[start_idx] = max_subsequence_matched + 1;
    return output

当然，差异很小，因为结果的变化只是序列的一小部分。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44076849

复制

相似问题

问可预见性上限
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问可预见性上限EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问可预见性上限
EN