Pyspark: match columns from two different dataframes and add value

By Ritesh Sahu - November 25, 2021

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:

df1=

| id |
| -- |
| 1  |
| 2  |
| 3  |
| 4  | 
| 5  |

df2 =

| id |
| -- |
| 2  |
| 5  |
| 1  |

So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:

df3 =

| id | is_used |
| -- | ------- |
| 1  |    X    |
| 2  |    X    |
| 3  |    NA   |
| 4  |    NA   |
| 5  |    X    |

I have tried this way, but the selection criteria places an "X" in all columns:

df3 = df3.withColumn('is_used', F.when(
    condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
    value = 'NA'
).otherwise('X'))

I would appreciate any help

from Recent Questions - Stack Overflow https://ift.tt/3nti7R3
https://ift.tt/eA8V8J

Search This Blog

Theprogrammersfirst | A technical portal.

Pyspark: match columns from two different dataframes and add value

Comments

Post a Comment

Popular posts from this blog

I get wrong characters when retreiving the message body of an email using TIdIMAP4.UIDRetrieveTextPeek2()

How to drop the all the 1's in a correlation matrix

Today Walkin 14th-Sept