How to drop one of any two highly correlated features having low correlation with target
I am working with the breast cancer dataset included in the scikit-learn's package, loaded like so:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df['target'] = df['target'].map({0:'malignant', 1:'benign'})
#data.head()
I can calculate and plot the the features correlation like so:
corr_mat = df.corr()
mask = np.triu(np.ones_like(corr_mat, dtype=bool))
heatmap = sns.heatmap(corr_mat, vmin=-1, vmax=1, mask=mask, cmap='BrBG')
That said, suppose I set abs(0.7)
as threshold to determine if features i
and j
are highly-correlated, so I can drop one of them. But then instead of dropping in any order, I want to make sure I drop the one with low correlation to the target variable df['target']
.
Something like:
for i in range(len(corr_mat.columns)):
for j in range(i):
if abs(corr_mat.iloc[i, j]) > 0.7:
# if correlation of corr_mat.columns[i] to df['target'] < corr_mat.columns[j] to df['target']
# drop corr_mat.columns[i]
# else drop corr_mat.columns[j]
Can someone help with this?
Comments
Post a Comment