entityMap|blocks|key|d7o4r|text|漂亮的下降方法。可以真正优化得更多。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|5d051|>>>+rdd=sc.parallelize(['a','b','c','d','e','f'])
#zipping+with+Index+to+rip+off+odd+and+even+elements,+to+group+consecutive+elements+in+future
>>>+rdd_odd=rdd.zipWithIndex().filter(lambda+(x,y):y%252!=0).map(lambda+(x,y):x).coalesce(1)
>>>+rdd_even=rdd.zipWithIndex().filter(lambda+(x,y):y%252==0).map(lambda+(x,y):x).coalesce(1)
>>>+rdd_2=rdd_even.zip(rdd_odd)
>>>+rdd_2.collect()
[('a',+'b'),+('c',+'d'),+('e',+'f')]|code-block|syntax|javascript|408gh|确保rdd_1中的元素数量为偶数。这实际上将形成配对连续元素的基础。|offset|length|style|CODE|fcv07^0|0|0|2|5|0^^$0|$]|1|@$2|3|4|5|6|7|8|O|9|@]|A|@]|B|$]]|$2|C|4|D|6|E|8|P|9|@]|A|@]|B|$F|G]]|$2|H|4|I|6|7|8|Q|9|@$J|R|K|S|L|M]]|A|@]|B|$]]|$2|N|4|-4|6|7|8|T|9|@]|A|@]|B|$]]]]

Pretty descent approach. Can really be optimized much more.

<pre><code>&gt;&gt;&gt; rdd=sc.parallelize(['a','b','c','d','e','f'])
#zipping with Index to rip off odd and even elements, to group consecutive elements in future
&gt;&gt;&gt; rdd_odd=rdd.zipWithIndex().filter(lambda (x,y):y%2!=0).map(lambda (x,y):x).coalesce(1)
&gt;&gt;&gt; rdd_even=rdd.zipWithIndex().filter(lambda (x,y):y%2==0).map(lambda (x,y):x).coalesce(1)
&gt;&gt;&gt; rdd_2=rdd_even.zip(rdd_odd)
&gt;&gt;&gt; rdd_2.collect()
[('a', 'b'), ('c', 'd'), ('e', 'f')]
</code></pre>

Ensure to have even number of elements in <code>rdd_1</code>. That will actually form the base for pairing consecutive elements.

entityMap|blocks|key|cgtv5|text|我认为您需要在RDD中指定元素的顺序，以确定如何将两个元素视为彼此“连续”。因为RDD可以由多个分区组成，所以spark不会知道partition_1中的一个元素是否与partition_2中的另一个元素连续。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|c8eu9|如果你提前知道你的数据，你就可以定义键，以及两个元素是如何“连续”的。在您的示例中，rdd是从list创建的，您可以使用索引作为键并执行连接。|62i3c|"""you+want+to+shift+arr+by+1+to+the+left,+then+join+back+to+arr.+Calculation+based+on+index"""

arr+=+['a','b','c','d','e','f']
rdd+=+sc.parallelize(arr,+2).zipWithIndex().cache()+#cache+if+rdd+is+small+

original_rdd+=+rdd.map(lambda+x:+(x[1],+x[0]))+#create+rdd+with+key=index,+value=item+in+list

shifted_rdd+=+rdd.map(lambda+x:+(x[1]-1,+x[0]))

results+=+original_rdd.join(shifted_rdd)
print(results.values().collect())|code-block|syntax|javascript|aaob8|为了在join中实现更好的性能，您可以对original_rdd和shifted_rdd使用范围分区。|offset|length|style|CODE|69qfk^0|0|0|0|3|4|K|C|X|B|0^^$0|$]|1|@$2|3|4|5|6|7|8|Q|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|R|9|@]|A|@]|B|$]]|$2|E|4|F|6|G|8|S|9|@]|A|@]|B|$H|I]]|$2|J|4|K|6|7|8|T|9|@$L|U|M|V|N|O]|$L|W|M|X|N|O]|$L|Y|M|Z|N|O]]|A|@]|B|$]]|$2|P|4|-4|6|7|8|10|9|@]|A|@]|B|$]]]]

I think you need to specify the order of element in your RDD to determine how 2 elements are considered "consecutive" to each other. Because your RDD can consists of multiple partitions, so spark won't know if 1 element in partition_1 is consecutive to another element in partition_2.

If you know your data well in advance, you can define the key and also how 2 elements are "consecutive". Given your example where rdd is created from list, you could use the index as key and do a join. 

<pre><code>"""you want to shift arr by 1 to the left, then join back to arr. Calculation based on index"""

arr = ['a','b','c','d','e','f']
rdd = sc.parallelize(arr, 2).zipWithIndex().cache() #cache if rdd is small 

original_rdd = rdd.map(lambda x: (x[1], x[0])) #create rdd with key=index, value=item in list

shifted_rdd = rdd.map(lambda x: (x[1]-1, x[0]))

results = original_rdd.join(shifted_rdd)
print(results.values().collect())
</code></pre>

To achieve better performance in <code>join</code>, you can use a range partitions for <code>original_rdd</code> and <code>shifted_rdd</code>.

Here is the actual pipeline. I'm loading text to RDD. I then clean it up.

<pre><code>rdd1 = sc.textFile("sometext.txt")

import re
import string

def Func(lines):
 lines = lines.lower() #make all text lowercase
 lines = re.sub('[%s]' % re.escape(string.punctuation), '', lines) #remove punctuation
 lines = re.sub('\w*\d\w*', '', lines) #remove numeric-containing strings
 lines = lines.split() #split lines
 return lines
rdd2 = rdd1.flatMap(Func)

stopwords = ['list of stopwords goes here'] 
rdd3 = rdd2.filter(lambda x: x not in stopwords) # filter out stopwords
rdd3.take(5) #resulting RDD

Out:['a',
 'b',
 'c',
 'd',
 'e']
</code></pre>

I need to do now is the start of a Markov Chain function. I want to pair up each element with its consecutive element such as:

[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), etc...]

How can I create a pair RDD from an RDD of single elements in PySpark?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

 这是实际的管道。我正在将文本加载到RDD中。然后我把它清理干净。 rdd1 = sc.textFile("sometext.txt")import reimport stringdef Func(lines):    lines = lines.lower() #make all text lowercase    ...

问如何在PySpark中从单个元素的RDD创建成对RDD？
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在PySpark中从单个元素的RDD创建成对RDD？EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在PySpark中从单个元素的RDD创建成对RDD？
EN