2020-11-28

Best way to perform large amount of Pandas Joins

I am trying to use two data frames for a simple lookup using Pandas. I have a main master data frame (left) and a lookup data frame (right). I want to left join them on the matching integer code and return the item title from the item_df.

I see a slight solution with a key value pair idea but it seems cumbersome. My idea is to merge the data frames together using col3 and name as key columns and keep the value from the right frame that I want which will be title. Thus I decide to drop the key column that I joined on so all I have left is the value. Now lets say I want to do this several times with my own manual naming conventions. For this I use rename to rename the value that I merged in. Now I would repeat this merge operation and rename my next join to something like second_title (see example below).

Is there a less cumbersome way to perform this repeated operation without constantly dropping the extra columns that are merged in and renaming the new column between each merge step?

Example code below:

import pandas as pd

master_dict: dict = {'col1': [3,4,8,10], 'col2': [5,6,9,10], 'col3': [50,55,59,60]}
master_df: pd.DataFrame = pd.DataFrame(master_dict)
item_dict: dict = {'name': [55,59,50,5,6,7], 'title': ['p1','p2','p3','p4','p5','p6']}
item_df: pd.DataFrame = pd.DataFrame(item_dict)
    
print(master_df.head())
   col1  col2  col3
0     3     5    50
1     4     6    55
2     8     9    59
3    10    10    60
print(item_df.head())
   name title
0    55    p1
1    59    p2
2    50    p3
3     5    p4
4     6    p5

# merge on col3 and name
combined_df = pd.merge(master_df, item_df, how = 'left', left_on = 'col3', right_on = 'name')
# rename title to "first_title"
combined_df.rename(columns = {'title':'first_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True) # remove 'name' column that was joined in from right frame
# repeat operation for "second_title"
combined_df = pd.merge(combined_df, item_df, how = 'left', left_on = 'col2', right_on = 'name')
combined_df.rename(columns = {'title': 'second_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True)
print(combined_df.head())
   col1  col2  col3 first_title second_title
0     3     5    50          p3           p4
1     4     6    55          p1           p5
2     8     9    59          p2          NaN
3    10    10    60         NaN          NaN


from Recent Questions - Stack Overflow https://ift.tt/2JiIW8L
https://ift.tt/eA8V8J

No comments:

Post a Comment