2023-09-10

Filtering data based on variable number of arguments

I have a requirement to filter data based on variable number of arguments. Basically, I am reading a table and I want to filter the data based on multiple regions that would be provided through a parameter in the function. The number of regions passed to the function could be variable. The region column would contain a string and we have to find the searched string in the that string.

Something as :

def read_regions(*regions):
    df = spark.read.table("my_input_table").filter(col("region").contains(*regions))
    return df

The function could be called as :

data = read_regions('US')

OR

data = read_regions('CHN', 'NL', 'ES')

Or with any number of regions. The data should be filtered accordingly and data returned.

Can someone please help

So, I want the data to be filtered based on the arguments passed to the function.

UPDATE:

Search strings -> 'USA', 'CHN'

Input:

orderid | campaign        | custid
1234    | Gen_X_USA_offr1 | c2234
5678    | Gen_Z_CHN_offr2 | c1345
7893    | Gen_X_EU_Tru2   | c4563

Output:

orderid | campaign        | custid
1234    | Gen_X_USA_offr1 | c2234
5678    | Gen_Z_CHN_offr2 | c1345

In above, the first 2records are selected in the output since the "campaign" column contains our search regions - 'USA' and 'CHN'. The third row is not selected into the output as it does not contain the search regions.

Happy to provide more clarification if required.

Thanks

Please advise.



No comments:

Post a Comment