How to make dummy columns only on variables witch appropriate number of categories and suffisant share category in column?
I have DataFrame in Python Pandas like below (both types of columns: numeric and object):
data types:
- COL1 - numeric
- COL2 - object
- COL3 - object
| COL1 | COL2 | COL3 | ... | COLn |
|---|---|---|---|---|
| 111 | A | Y | ... | ... |
| 222 | A | Y | ... | ... |
| 333 | B | Z | ... | ... |
| 444 | C | Z | ... | ... |
| 555 | D | P | ... | ... |
And i need to make dummy coding (pandas.get_dummies()) only on categorical variables which has:
- max 3 categories in variable
- The minimum percentage of the category's share of the variable is 0.4
So, for example:
- COL2 does not meetr requirement nr. 1 (has 4 different categories: A, B, C, D), so remove it
- In COL3 category "P" does not meet requirements nr.2 (share is 1/5 = 0.2), so use only categories "Y" and "Z" to dummy coding
So, as a result I need something like below:
COL1 | COL3_Y | COL3_Z | ... | COLn
-----|--------|--------|------|------
111 | 1 | 0 | ... | ...
222 | 1 | 0 | ... | ...
333 | 0 | 1 | ... | ...
444 | 0 | 1 | ... | ...
555 | 0 | 0 | ... | ...
Comments
Post a Comment