2023-02-14

How to drop one of any two highly correlated features having low correlation with target

I am working with the breast cancer dataset included in the scikit-learn's package, loaded like so:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df['target'] = df['target'].map({0:'malignant', 1:'benign'})
#data.head()

I can calculate and plot the the features correlation like so:

corr_mat = df.corr()
mask = np.triu(np.ones_like(corr_mat, dtype=bool))
heatmap = sns.heatmap(corr_mat, vmin=-1, vmax=1, mask=mask, cmap='BrBG')

enter image description here

That said, suppose I set abs(0.7) as threshold to determine if features i and j are highly-correlated, so I can drop one of them. But then instead of dropping in any order, I want to make sure I drop the one with low correlation to the target variable df['target'].

Something like:

for i in range(len(corr_mat.columns)):
    for j in range(i):
        if abs(corr_mat.iloc[i, j]) > 0.7:
        # if correlation of corr_mat.columns[i] to df['target'] < corr_mat.columns[j] to df['target']
        #     drop corr_mat.columns[i]
        # else drop corr_mat.columns[j]

Can someone help with this?



No comments:

Post a Comment