apache spark sql - Compare two PySpark DataFrames and append the results side by side - Stack Overflow

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result n

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it.

DF1:

Claim_number Claim_Status
1001 Closed
1002 In Progress
1003 open

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it.

DF1:

Claim_number Claim_Status
1001 Closed
1002 In Progress
1003 open

Df2:

Claim_number Claim_Status
1001 Closed
1002 open
1004 In Progress

Expected Result in pySpark:

DF3:

Claim_number_DF1 Claim_number_DF2 Comparison_of_Claim_number Claim_status_DF1 Claim_status_DF2 Comparison_of_Claim_Status
1001 1001 TRUE Closed Closed TRUE
1002 1002 TRUE In Progress Open FALSE
1003 1004 FALSE open In Progress FALSE
Share Improve this question asked Nov 20, 2024 at 15:13 SrinivasanSrinivasan 133 bronze badges 5
  • What is your actual question? What have you tried? – Andrew Commented Nov 20, 2024 at 15:57
  • I want to compare two dataframe and if the column value matches it should populate True and if it not matches it should populate False next to the column – Srinivasan Commented Nov 20, 2024 at 15:59
  • That's not a question, that's asking SO to write your code for you. – Andrew Commented Nov 20, 2024 at 16:06
  • Sorry i dont understand your question... – Srinivasan Commented Nov 20, 2024 at 16:08
  • 1 Unlike Pandas dataframes PySpark dataframes are not ordered. So the task is not doable unless a criterium is provided which rows of each dataframe should be compared. Simply saying "take the third row from df1 and compare it with the third row from df2" does not work unfortunately. There is no "third row", at least not when using large datasets with multiple partitions. – werner Commented Nov 20, 2024 at 18:24
Add a comment  | 

1 Answer 1

Reset to default 0

DF(s) are nor ordered but distributed in different places so this is an invalid ask.

However what you can do instead is following -

  • You can assume DF1 is master DF and join it with DF2 using Claim_number and if DF2 has no claim number then depending on join type, you can choose to ignore (inner join) or produce then as null(left outer join)

If that is what yous ask is, then here is the solution.

final_Df = df1.join(df2, Claim_number, "inner").distinct()

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1742350906a4427495.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信