我正在尝试对安然数据语料库进行二元语法分析:
for message in messages.find():
sentences = [ s for s in nltk.tokenize.sent_tokenize(message["body"]) ]
for sentence in sentences:
words = words + PunktWordTokenizer().tokenize(sentence)
finder = BigramCollocationFinder.from_words(words)
print finder.nb
我正在寻找aggregate()函数的并行版本,并且看起来和正是我正在寻找的。
作为测试,我创建了一个包含10m条记录的数据集
blockSize <- 5000
records <- blockSize * 2000
df <- data.frame(id=1:records, value=rnorm(records))
df$period <- round(df$id/blockSize)
# now I want to aggregate by period and return mean of every block:
x <- aggregate(val