我正在用python实现PC算法。该算法构造了n元高斯分布的图形模型。这个图形模型基本上是一个有向无环图的骨架,这意味着如果一个结构如下:
(x1)---(x2)---(x3)
在图中,则x1是由给定x2的x3独立的。更一般地,如果A是图的邻接矩阵,并且A(i,j)=A( j,i) =0(i和j之间有一条缺失的边),那么i和j通过从i到j的任何路径中出现的所有变量在条件上是独立的。为了统计和机器学习的目的,可以“学习”底层的图形模型。如果我们对联合高斯n变量随机变量有足够的观测值,我们可以使用PC算法,其工作原理如下:
given n as the number of variables observed, initialize the graph as G=K(n)
for each pair i,j of nodes:
if exists an edge e from i to j:
look for the neighbours of i
if j is in neighbours of i then remove j from the set of neighbours
call the set of neighbours k
TEST if i and j are independent given the set k, if TRUE:
remove the edge e from i to j
该算法还计算图的分离集,由另一种算法使用,该算法从骨架和pc算法返回的分离集开始构造dag。这就是我到目前为止所做的:
def _core_pc_algorithm(a,sigma_inverse):
l = 0
N = len(sigma_inverse[0])
n = range(N)
sep_set = [ [set() for i in n] for j in n]
act_g = complete(N)
z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)
while l<N:
for (i,j) in itertools.permutations(n,2):
adjacents_of_i = adj(i,act_g)
if j not in adjacents_of_i:
continue
else:
adjacents_of_i.remove(j)
if len(adjacents_of_i) >=l:
for k in itertools.combinations(adjacents_of_i,l):
if N-len(k)-3 < 0:
return (act_g,sep_set)
if test(sigma_inverse,z,i,j,l,a,k):
act_g[i][j] = 0
act_g[j][i] = 0
sep_set[i][j] |= set(k)
sep_set[j][i] |= set(k)
l = l + 1
return (act_g,sep_set)
A是调整参数α,我将用它来测试条件独立性,sigma_inverse是采样观测值的协方差矩阵的逆矩阵。此外,我的测试是:
def test(sigma_inverse,z,i,j,l,a,k):
def erfinv(x): #used to approximate the inverse of a gaussian cumulative density function
sgn = 1
a = 0.147
PI = numpy.pi
if x<0:
sgn = -1
temp = 2/(PI*a) + numpy.log(1-x**2)/2
add_1 = temp**2
add_2 = numpy.log(1-x**2)/a
add_3 = temp
rt1 = (add_1-add_2)**0.5
rtarg = rt1 - add_3
return sgn*(rtarg**0.5)
def indep_test_ijK(K): #compute partial correlation of i and j given ONE conditioning variable K
part_corr_coeff_ij = z(sigma_inverse,i,j) #this gives the partial correlation coefficient of i and j
part_corr_coeff_iK = z(sigma_inverse,i,K) #this gives the partial correlation coefficient of i and k
part_corr_coeff_jK = z(sigma_inverse,j,K) #this gives the partial correlation coefficient of j and k
part_corr_coeff_ijK = (part_corr_coeff_ij - part_corr_coeff_iK*part_corr_coeff_jK)/((((1-part_corr_coeff_iK**2))**0.5) * (((1-part_corr_coeff_jK**2))**0.5)) #this gives the partial correlation coefficient of i and j given K
return part_corr_coeff_ijK == 0 #i independent from j given K if partial_correlation(i,k)|K == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
if l == 0:
return z(sigma_inverse,i,j) == 0 #i independent from j <=> partial_correlation(i,j) == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
elif l == 1:
return indep_test_ijK(k[0])
elif l == 2:
return indep_test_ijK(k[0]) and indep_test_ijK(k[1]) #ASSUMING THAT IJ ARE INDEPENDENT GIVEN Y,Z <=> IJ INDEPENDENT GIVEN Y AND IJ INDEPENDENT GIVEN Z
else: #i have to use the independent test with the z-fisher function
return indep_test()
其中z是一个lambda,它接收一个矩阵(协方差矩阵的逆矩阵),一个整数i,一个整数j,它使用以下规则(我在我老师的幻灯片中读到的)计算i和j的部分相关性,给定所有其余变量:
corr(i,j)|REST = -var^-1(i,j)/sqrt(var^-1(i,i)*var^-1(j,j))
此应用程序的主要核心是indep_test()函数:
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
此函数实现一个统计测试,该测试使用估计的部分相关性的fisher z变换。我以两种方式使用这个算法:
在这两种情况下,我并不总是得到正确的结果,要么是因为我知道某个数据集背后的DAG,要么是因为我知道生成模型,但它与我的算法学习的模型不一致。我完全知道这不是一项微不足道的任务,我可能误解了理论上的概念,甚至在我这里省略的部分代码中也犯了错误;但首先我想知道(从比我更有经验的人那里),我写的测试是否正确,以及是否有执行这种测试的库函数,我尝试过搜索,但找不到任何合适的函数。
发布于 2018-12-14 08:00:13
我开门见山了。上述代码中最关键的问题涉及以下错误:
sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)
我弄错了n的平均值,它不是精度矩阵的大小,而是总多变量观察值的数量(在我的例子中,是10000而不是5)。另一个错误的假设是z(sigma_inversei)必须提供i和j的部分相关性。这是不正确的,z是精度矩阵的适当子集上的费舍尔变换,它估计给定K的I和j的偏相关。正确的测试如下:
if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)
r = CM[i, j] #r is the partial correlation of i and j
elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame
r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given K
else: #more than one conditioning variable
CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for
PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset
r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))
r = min(0.999999, max(-0.999999,r))
res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmation
return 2 * (1 - norm.cdf(abs(res))) #obtaining p-value
我希望有人能发现这对我有帮助
https://stackoverflow.com/questions/53679288
复制相似问题