How to make dummy columns only on variables witch appropriate number of categories and suffisant share category in column?

I have DataFrame in Python Pandas like below (both types of columns: numeric and object):

data types:

  • COL1 - numeric
  • COL2 - object
  • COL3 - object
COL1 COL2 COL3 ... COLn
111 A Y ... ...
222 A Y ... ...
333 B Z ... ...
444 C Z ... ...
555 D P ... ...

And i need to make dummy coding (pandas.get_dummies()) only on categorical variables which has:

  1. max 3 categories in variable
  2. The minimum percentage of the category's share of the variable is 0.4

So, for example:

  1. COL2 does not meetr requirement nr. 1 (has 4 different categories: A, B, C, D), so remove it
  2. In COL3 category "P" does not meet requirements nr.2 (share is 1/5 = 0.2), so use only categories "Y" and "Z" to dummy coding

So, as a result I need something like below:

COL1 | COL3_Y | COL3_Z | ...  | COLn
-----|--------|--------|------|------
111  | 1      | 0      | ...  | ...
222  | 1      | 0      | ...  | ...
333  | 0      | 1      | ...  | ...
444  | 0      | 1      | ...  | ...
555  | 0      | 0      | ...  | ...


Comments

Popular posts from this blog

Spring Elasticsearch Operations

Network Error and Timeout on Authorize.net JS

Object oriented programming concepts (OOPs)