2021-03-30

The weighted means of group is not equal to the total mean in pandas groupby

I have a strange problem with calculating the weighted mean of a pandas dataframe. I want to do the following steps:

(1) calculate the weighted mean of all the data
(2) calculate the weighted mean of each group of data

The issue is when I do step 2, then the mean of groups means (weighted by the number of members in each group) is not the same as the weighted mean of all the data (step 1). Mathematically it should be (here). I even thought maybe the issue is the dtype, so I set everything on float64 but the problem still exists. Below I provided a simple example that illustrates this problem:

My dataframe has a data, a weight and group columns:

data = np.array([
    0.20651903, 0.52607571, 0.60558061, 0.97468593, 0.10253621, 0.23869854,
    0.82134792, 0.47035085, 0.19131938, 0.92288234
])
weights = np.array([
    4.06071562, 8.82792146, 1.14019687, 2.7500913, 0.70261312, 6.27280216,
    1.27908358, 7.80508994, 0.69771745, 4.15550846
])
groups = np.array([1, 1, 2, 2, 2, 2, 3, 3, 4, 4])
df = pd.DataFrame({"data": data, "weights": weights, "groups": groups})
print(df)
>>> print(df)
       data   weights  groups
0  0.206519  4.060716       1
1  0.526076  8.827921       1
2  0.605581  1.140197       2
3  0.974686  2.750091       2
4  0.102536  0.702613       2
5  0.238699  6.272802       2
6  0.821348  1.279084       3
7  0.470351  7.805090       3
8  0.191319  0.697717       4
9  0.922882  4.155508       4

# Define a weighted mean function to apply to each group
def my_fun(x, y):
    tmp = np.average(x, weights=y)
    return tmp

# Mean of the population
total_mean = np.average(np.array(df["data"], dtype="float64"),
                        weights= np.array(df["weights"], dtype="float64"))
# Group data 
group_means = df.groupby("groups").apply(myfunc, 'data', 'weights')

# number of members of each group
counts = np.array([2, 4, 2, 2],dtype="float64")

# Total mean calculated from mean of groups mean weighted by counts of each group
total_mean_from_group_means = np.average(np.array(group_means,
                                              dtype="float64"),
                                     weights=counts)

print(total_mean)
0.5070955626929458

print(total_mean_from_group_means)
0.5344436242465216

As you can see the total mean calculated from group means is not equal to the total mean. What I am doing wrong here?



from Recent Questions - Stack Overflow https://ift.tt/2PhrxAK
https://ift.tt/eA8V8J

No comments:

Post a Comment