可以通过以下步骤实现:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("DataFrameDiff").getOrCreate()
data1 = [("John", 25, "USA"), ("Alice", 30, "Canada"), ("Bob", 35, "UK")]
data2 = [("John", 25, "USA"), ("Alice", 28, "Canada"), ("Bob", 35, "UK")]
df1 = spark.createDataFrame(data1, ["Name", "Age", "Country"])
df2 = spark.createDataFrame(data2, ["Name", "Age", "Country"])
joined_df = df1.join(df2, on=["Name", "Age"], how="inner")
diff_df = joined_df.select("Name", "Age", (col("Country_x") != col("Country_y")).alias("CountryDiff"))
diff_df.show()
以上代码将显示每个相应列的差异结果,其中"CountryDiff"列将显示True或False,表示两个数据帧上对应列的值是否相同。
对于pyspark的使用,可以参考腾讯云的Apache Spark产品介绍:Apache Spark产品介绍
注意:以上答案仅供参考,具体实现方式可能因环境和需求而异。
领取专属 10元无门槛券
手把手带您无忧上云