Pyspark: match columns from two different dataframes and add value

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:

df1=

| id |
| -- |
| 1  |
| 2  |
| 3  |
| 4  | 
| 5  |

df2 =

| id |
| -- |
| 2  |
| 5  |
| 1  |

So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:

df3 =

| id | is_used |
| -- | ------- |
| 1  |    X    |
| 2  |    X    |
| 3  |    NA   |
| 4  |    NA   |
| 5  |    X    |

I have tried this way, but the selection criteria places an "X" in all columns:

df3 = df3.withColumn('is_used', F.when(
    condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
    value = 'NA'
).otherwise('X'))

I would appreciate any help



from Recent Questions - Stack Overflow https://ift.tt/3nti7R3
https://ift.tt/eA8V8J

Comments

Popular posts from this blog

Spring Elasticsearch Operations

Network Error and Timeout on Authorize.net JS

Object oriented programming concepts (OOPs)