这是实际的管道。我正在将文本加载到RDD中。然后我把它清理干净。
rdd1 = sc.textFile("sometext.txt")
import re
import string
def Func(lines):
lines = lines.lower() #make all text lowercase
lines = re.sub('[%s]' % re.escape(string.punctuation), '', lines) #remove punctuation
lines = re.sub('\w*\d\w*', '', lines) #remove numeric-containing strings
lines = lines.split() #split lines
return lines
rdd2 = rdd1.flatMap(Func)
stopwords = ['list of stopwords goes here']
rdd3 = rdd2.filter(lambda x: x not in stopwords) # filter out stopwords
rdd3.take(5) #resulting RDD
Out:['a',
'b',
'c',
'd',
'e']
我现在要做的是马尔可夫链函数的开始。我想将每个元素与其连续的元素配对,例如:
('a','b'),('b','c'),('c','d'),('d','e'),等等...
https://stackoverflow.com/questions/55877730
复制相似问题