文章/答案/技术大牛

发布

社区首页 >问答首页 >生成重复速率的numpy数组

问生成重复速率的numpy数组
EN

Stack Overflow用户

提问于 2018-12-08 16:02:50

回答 2查看 202关注 0票数 0

这是我的问题:我必须生成一些综合数据(如7/8列)，相互关联(使用pearson系数)。我可以很容易地做到这一点，但是接下来，我必须在每一列中插入一个百分比的重复项(是的，pearson系数将更低)，这对于每一列来说是不同的。问题是，我不想亲自插入重复，因为在我的情况下，这将是欺骗。

有人知道如何生成已经重复的相关数据？我已经搜索过了，但通常问题都是关于删除或避免重复的。

Language: python3用于生成相关数据--我使用以下简单代码：Generatin相关数据

python

python-3.x

numpy

pearson

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-12-09 16:47:24

我找到解决办法了。我把代码发出来，可能会对某人有帮助。

#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1

#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))

from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)

from scipy.linalg import cholesky

upper_chol = cholesky(a)

# Finally, compute the inner product of upper_chol and rnd
ans = rnd @ upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)

#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4', 
                            'att5', 'att6', 'att7', 'att8'])

#last step is to truncate float values of ans in a variable way, so i got 
#duplicates in varying percentage
a = df.values
for i in range(8):
     trunc = np.random.randint(5,12)
     print(trunc)
     a.T[i] = a.T[i].round(decimals=trunc)


#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value

最后，这些是我对每一栏重复的百分比：

duplicate rate attribute: att1 = 5.159390000000002

duplicate rate attribute: att2 = 11.852260000000001

duplicate rate attribute: att3 = 12.036079999999998

duplicate rate attribute: att4 = 35.10611

duplicate rate attribute: att5 = 4.6471599999999995

duplicate rate attribute: att6 = 35.46553

duplicate rate attribute: att7 = 0.49115000000000464

duplicate rate attribute: att8 = 37.33252

票数 0

Stack Overflow用户

发布于 2018-12-08 16:09:53

试着做这样的事情：

indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))

for index in indices:
  array.append(array[index])

这里，我假设您的数据存储在array中，这是一个ndarray，其中每一行包含7/8列的数据。上面的代码应该创建一个随机索引数组，您选择的条目(行)将再次添加到数组中。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53684305

复制

相似问题

问生成重复速率的numpy数组
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生成重复速率的numpy数组EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生成重复速率的numpy数组
EN