2022-04-17

Need to extract the data based on delimiter and map to data frame in pyspark

Need to extract the data with ~~ delimiter and map accordingly to the required columns.

But the output somehow is randomly assigning and getting wrong results/mappings. How can we achieve this using pyspark.

Sample Date: MESSAGE from Dataframe column {5:~~:2016:ABCDEF123~~:2323:002~~:2016:567~~::555:~~XXABC~~:2016:123~~:555:~~YYYYY~~-} {5:~~:2016:DEF~~:2323:009~~:2016:666~~::555:~~ZZZZ~~:2016:788~~:555:~~DDDDD~~:2016:5013~~:555:~~TTTTTTTT~~-}

Expected Data Frame Output:

PARENT_REF|PARENT_TXN||CHILD_REF|_CHILD_ORG

ABCDEF123|002|567|XXABC

ABCDEF123|002|123|YYYYY

DEF|009|666|ZZZZ

DEF|009|788|DDDDD

DEF|009|5013|TTTTTTTT

First 2016 is PARENT_REF First 2323 is Parent TXN Subsequent 2016 is child Ref Susequent 555 is Child org

Note - Child records can vary for a parent Record

Code Snippet: from pyspark.sql import functions as F

df2=df1.select("MESSAGE")

df3=df2.withColumn("PARENT_REF",F.regexp_extract(F.col('MESSAGE'),'{5::.:2016:(.?)~~:"',1)) .withColumn("PARENT_TXN",F.regexp_extract(F.col('MESSAGE'),'{5::.:2323:(.?)~~:"',1)) .withColumn("CHILD_REF",F.regexp_extract(F.col('MESSAGE'),'{5::.:2016:(.?)~~:"',1)) .withColumn("CHILD_ORG",F.regexp_extract(F.col('MESSAGE'),'{5::.:555:(.?)~~:"',1))

df3.show()



No comments:

Post a Comment