When working with data, it is very useful to be able to group and aggregate data by multiple columns to understand the various segments of our data.
Using pandas, we can easily group data using the pandas groupby function. However, when grouping by multiple columns and looking to compute summary statistics, we need to do more work to get code that is easy to use.
If we are looking to group the data by one column and then aggregate and summarize, we can use the pandas describe() function and pandas sum() function and obtain a very easy to use dataframe.
However, when we group by multiple columns and use the pandas describe() function and pandas sum() function, then the return dataframe is a dataframe of dataframes.
With a dataframe of dataframes, you have to do a little more work to get something that is easy to work with.
The rest of the article is code which will show you how to use pandas to group and aggregate data by multiple columns.
Grouping and Summarizing Numeric Data by Multiple Columns
Below is a function which will group and aggregate multiple columns using pandas if you are only working with numerical variables. In the following code, we will be grouping the data by multiple columns and computing the mean, standard deviation, sum, min, max and various percentiles for the various gorupings.
import pandas as pd
#ds is the dataframe we want to summarize
#group_vars is a string of the column names delimited by spaces that we want to group the data by
#cont_vars is a string of the column names of the numeric data delimited by spaces that we want to summarize
def summarize_ds(ds, group_vars, cont_vars):
#cont summary
cont_des = ds.groupby(group_vars.split(" "))[cont_vars.split(" ")].describe()
sum_des = ds.groupby(group_vars.split(" "))[cont_vars.split(" ")].sum()
df_cont = cont_des[cont_vars.split(" ")[0]].reset_index()
df_sum = sum_des[cont_vars.split(" ")[0]].rename('sum').reset_index()
df = df_cont.merge(df_sum, on=group_vars.split(" "))
df["variable"] = cont_vars.split(" ")[0]
for x in range(1,len(cont_vars.split(" "))):
df_cont = cont_des[cont_vars.split(" ")[x]].reset_index()
df_sum = sum_des[cont_vars.split(" ")[x]].rename('sum').reset_index()
df2 = df_cont.merge(df_sum, on=group_vars.split(" "))
df2["variable"] = cont_vars.split(" ")[x]
df = df.append(df2, ignore_index=True)
#clean up
cols = group_vars.split(" ")
cols_add = ["variable","sum","mean","std","min","25%","50%","75%","max"]
for col in cols_add:
cols.append(col)
df = df[cols]
return df
If you are working with categorical variables, then we won’t have the ability to use the describe() function, but instead, we will be using the count() function to get the distribution.
Let’s say I have the following dataframe:
animal_type gender weight age state trained
cat male 10 1 CA no
dog male 20 4 FL no
dog male 30 5 NY no
cat female 40 3 FL yes
cat female 10 2 NY yes
dog female 20 4 TX yes
cat female 50 6 TX yes
dog male 60 1 CA no
dog male 70 5 NY no
cat female 80 4 FL yes
cat female 90 3 TX yes
cat male 100 2 TX no
dog female 80 4 FL no
If I want to group the dataframe by animal_type and gender, and summarize the columns age and weight, then could call our function as so and get the following output:
group_vars = "animal_type gender"
cont_vars = "age weight"
summarize_ds(df, group_vars, cont_vars)
#output:
animal_type gender variable sum mean std min 25% 50% 75% max
0 cat female age 18 3.60 1.516575 2.0 3.00 3.0 4.00 6.0
1 cat male age 3 1.50 0.707107 1.0 1.25 1.5 1.75 2.0
2 dog female age 8 4.00 0.000000 4.0 4.00 4.0 4.00 4.0
3 dog male age 15 3.75 1.892969 1.0 3.25 4.5 5.00 5.0
4 cat female weight 270 54.00 32.093613 10.0 40.00 50.0 80.00 90.0
5 cat male weight 110 55.00 63.639610 10.0 32.50 55.0 77.50 100.0
6 dog female weight 100 50.00 42.426407 20.0 35.00 50.0 65.00 80.0
7 dog male weight 180 45.00 23.804761 20.0 27.50 45.0 62.50 70.0
Grouping and Aggregating Categorical Data by Multiple Columns
Below is a function which will group and aggregate multiple columns using pandas if you are only working with categorical variables.
Here, instead of the summary statistics, we are just calculating the counts for each of the levels within each categorical variable.
import pandas as pd
#ds is the dataframe we want to summarize
#group_vars is a string of the column names delimited by spaces that we want to group the data by
#cat_vars is a string of the column names of the categorical data delimited by spaces that we want to summarize
def summarize_ds(ds, group_vars, cat_vars):
y = group_vars.split(" ")
y.append(cat_vars.split(" ")[0])
df = ds.groupby(y)[cat_vars.split(" ")[0]].count().rename('count').reset_index()
df["variable"] = cat_vars.split(" ")[0]
df["level"] = df[cat_vars.split(" ")[0]]
df.drop(columns=[cat_vars.split(" ")[0]])
for x in range(1, len(cat_vars.split(" "))):
y = group_vars.split(" ")
y.append(cat_vars.split(" ")[x])
df2 = ds.groupby(y)[cat_vars.split(" ")[x]].count().rename('count').reset_index()
df2["variable"] = cat_vars.split(" ")[x]
df2["level"] = df2[cat_vars.split(" ")[x]]
df2.drop(columns=[cat_vars.split(" ")[x]])
df = df.append(df2, ignore_index=True)
#clean up
cols = group_vars.split(" ")
cols_add = ["variable","level","count"]
for col in cols_add:
cols.append(col)
df = df[cols]
return df
If I want to group the dataframe from above by animal_type and gender, and summarize the columns state and trained, then can call our function as so and get the following output:
group_vars = "animal_type gender"
cat_vars = "state trained"
summarize_ds(df, group_vars, cat_vars)
#output:
animal_type gender variable level count
0 cat female state FL 2
1 cat female state NY 1
2 cat female state TX 2
3 cat male state CA 1
4 cat male state TX 1
5 dog female state FL 1
6 dog female state TX 1
7 dog male state CA 1
8 dog male state FL 1
9 dog male state NY 2
10 cat female trained yes 5
11 cat male trained no 2
12 dog female trained no 1
13 dog female trained yes 1
14 dog male trained no 4
Grouping and Aggregating a Dataframe by Multiple Columns
Below is the function if you have both categorical and numeric variables and want to have all summarizations in the same dataframe.
import pandas as pd
#ds is the dataframe we want to summarize
#group_vars is a string of the column names delimited by spaces that we want to group the data by
#cat_vars is a string of the column names of the categorical data delimited by spaces that we want to summarize
#cont_vars is a string of the column names of the numerical data delimited by spaces that we want to summarize
def summarize_ds(ds, group_vars, cat_vars, cont_vars):
#cont summary
cont_des = ds.groupby(group_vars.split(" "))[cont_vars.split(" ")].describe()
sum_des = ds.groupby(group_vars.split(" "))[cont_vars.split(" ")].sum()
df_cont = cont_des[cont_vars.split(" ")[0]].reset_index()
df_sum = sum_des[cont_vars.split(" ")[0]].rename('sum').reset_index()
df = df_cont.merge(df_sum, on=group_vars.split(" "))
df["variable"] = cont_vars.split(" ")[0]
for x in range(1,len(cont_vars.split(" "))):
df_cont = cont_des[cont_vars.split(" ")[x]].reset_index()
df_sum = sum_des[cont_vars.split(" ")[x]].rename('sum').reset_index()
df2 = df_cont.merge(df_sum, on=group_vars.split(" "))
df2["variable"] = cont_vars.split(" ")[x]
df = df.append(df2, ignore_index=True)
df["type"] = "numeric"
df["level"] = "N/A"
#cat_summary
for x in range(0, len(cat_vars.split(" "))-1):
y = group_vars.split(" ")
y.append(cat_vars.split(" ")[x])
df2 = ds.groupby(y)[cat_vars.split(" ")[x]].count().rename('count').reset_index()
df2["variable"] = cat_vars.split(" ")[x]
df2["type"] = "categorical"
df2["level"] = df2[cat_vars.split(" ")[x]]
df2.drop(columns=[cat_vars.split(" ")[x]])
df = df.append(df2, ignore_index=True)
#clean up
cols = group_vars.split(" ")
cols_add = ["type","variable","level","count","sum","mean","std","min","25%","50%","75%","max"]
for col in cols_add:
cols.append(col)
df = df[cols]
return df
If I want to group the dataframe from above by animal_type and gender, and summarize all of the columns (age, weight, state, and trained), then can call our function as so and get the following output:
group_vars = "animal_type gender"
cont_vars = "age weight"
cat_vars = "state trained"
summarize_ds(df, group_vars, cat_vars, cont_vars)
#output:
animal_type gender type variable level count sum mean std min 25% 50% 75% max
0 cat female numeric age N/A 5.0 18.0 3.60 1.516575 2.0 3.00 3.0 4.00 6.0
1 cat male numeric age N/A 2.0 3.0 1.50 0.707107 1.0 1.25 1.5 1.75 2.0
2 dog female numeric age N/A 2.0 8.0 4.00 0.000000 4.0 4.00 4.0 4.00 4.0
3 dog male numeric age N/A 4.0 15.0 3.75 1.892969 1.0 3.25 4.5 5.00 5.0
4 cat female numeric weight N/A 5.0 270.0 54.00 32.093613 10.0 40.00 50.0 80.00 90.0
5 cat male numeric weight N/A 2.0 110.0 55.00 63.639610 10.0 32.50 55.0 77.50 100.0
6 dog female numeric weight N/A 2.0 100.0 50.00 42.426407 20.0 35.00 50.0 65.00 80.0
7 dog male numeric weight N/A 4.0 180.0 45.00 23.804761 20.0 27.50 45.0 62.50 70.0
8 cat female categorical state FL 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
9 cat female categorical state NY 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
10 cat female categorical state TX 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
11 cat male categorical state CA 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
12 cat male categorical state TX 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
13 dog female categorical state FL 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
14 dog female categorical state TX 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
15 dog male categorical state CA 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
16 dog male categorical state FL 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
17 dog male categorical state NY 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
18 cat female categorical trained yes 5.0 NaN NaN NaN NaN NaN NaN NaN NaN
19 cat male categorical trained no 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
20 dog female categorical trained no 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
21 dog female categorical trained yes 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
22 dog male categorical trained no 4.0 NaN NaN NaN NaN NaN NaN NaN NaN
Hopefully this article has been beneficial to be able to use pandas to group and aggregate by multiple columns and summarize both numerical and categorical data with pandas.