When working with data as a data science or data analyst, it’s important to be able to find the basic descriptive statistics of a set of data.
There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.
One of the most commonly used procedures in SAS is the PROC MEANS procedure. In this article, you’ll learn the Python equivalent of PROC MEANS (and note, getting a Python equivalent of PROC SUMMARY and PROC HPSUMMARY will be similar).
This article contains the following examples (you can skip to one using the links below or keep reading to reach them sequentially):
- PROC MEANS Equivalent in Python
- PROC MEANS with OUTPUT Statement Equivalent in Python
- PROC MEANS with Multiple Variables and OUTPUT Statement Equivalent in Python
- PROC MEANS with NMISS Equivalent in Python
- PROC MEANS with CLASS Statement Equivalent in Python
- PROC MEANS with CLASS Statement, Multiple Variables and OUTPUT Statement Equivalent in Python
When using PROC MEANS, we need to provide a dataset, class and analysis variables, statistical options, and output datasets.
Below is an example of PROC MEANS which we will replicate in Python.
Let’s say we have data such as the following:
In SAS, we can read this in using a XLSX libname statement or PROC IMPORT.
Next, we want to get some descriptive statistics using PROC MEANS.
PROC MEANS Equivalent in Python
In SAS, when we want to find the descriptive statistics of a variable in a dataset, we use the PROC MEANS procedure.
Below is the PROC MEANS I’m going to replicate in Python:
The output from this PROC MEANS is below:
To get the Python equivalent of PROC MEANS, we will use the pandas library of Python, and utilize the describe() function:
import pandas as pd
import numpy as np
df = pd.read_excel(path + "example_data.xlsx")
df["height"].describe()
#output:
#count 8.00000
#mean 26.25000
#std 8.34523
#min 15.00000
#25% 20.00000
#50% 25.00000
#75% 31.25000
#max 40.00000
#Name: height, dtype: float64
As you can see, we get the same results, except for the percentiles. That is because the pandas .describe() function does a linear interpolation. To get the quantile, you should use the quantile() function.
PROC MEANS with OUTPUT Statement Equivalent in Python
Many times, we want to take the descriptive statistics from a data and create new data with these statistics.
With PROC MEANS, we can add an output statement and get the following outputted dataset.
Doing this in Python is super easy. All you need to do is store the outputted DataFrame in a variable:
example_out1 = df["height"].describe()
Now, you can use this new DataFrame like any other DataFrame – just like the outputted dataset in SAS.
PROC MEANS with Multiple Variables and OUTPUT Statement Equivalent in Python
Of course, when doing data analysis, usually we want to look at multiple variables and multiple groups.
In SAS, adding another analysis variable is very easy. Below is the PROC MEANS from above with the “weight” variable now added.
Here’s the output and output dataset.
To replicate this PROC MEANS in Python, all you need to do is add another variable when subsetting the DataFrame.
example_out2 = df[["height","weight"]].describe()
print(example_out2)
#output:
# height weight
#count 8.00000 8.000000
#mean 26.25000 48.125000
#std 8.34523 22.350695
#min 15.00000 20.000000
#25% 20.00000 28.750000
#50% 25.00000 50.000000
#75% 31.25000 62.500000
#max 40.00000 80.000000
PROC MEANS with NMISS Equivalent in Python
One thing that the describe() function does not do is calculated the number of missing values.
To calculate the number of missing values in SAS with PROC MEANS is easily done with the NMISS option.
The output of the above PROC MEANS shows no missing values for the “height” variable:
To get the number of missing values of a series in Python, we use the isnull() and sum() functions.
nmiss = df["height"].isnull().sum()
print(nmiss)
#output:
#0
PROC MEANS with CLASS Statement Equivalent in Python
Next, when doing data analysis, usually, we want to find descriptive statistics by different groups.
For our data, for example, we have the “type” variable and this variable has different types of animal.
When presenting our data, we know that dogs are different than cats, and cats are different than pigs.
When creating the PROC MEANS to get the descriptive statistics by group, all we need to do is add CLASS to the PROC MEANS.
The output from this PROC MEANS is shown below:
Here is the outputted dataset from the above PROC MEANS:
To get the Python equivalent of PROC MEANS with a CLASS statement, we can do the following.
The pandas DataFrame has a function groupby() which allows you to group the data.
Using this function, we can get the same output as above:
example_out3 = df.groupby("type")["height"].describe().reset_index()
print(example_out3)
#output:
# type count mean std min 25% 50% 75% max
#0 Cat 3.0 20.0 5.000000 15.0 17.5 20.0 22.5 25.0
#1 Dog 2.0 30.0 14.142136 20.0 25.0 30.0 35.0 40.0
#2 Pig 3.0 30.0 5.000000 25.0 27.5 30.0 32.5 35.0
To get exactly the outputted data from above, we can keep only the columns we want (mean and std), and rename those columns.
example_out3.rename(columns={"mean":"height_avg", "std":"height_std"}, inplace=True)
example_out3 = example_out3[["type","height_avg","height_std"]]
print(example_out3)
#output:
# type height_avg height_std
#0 Cat 20.0 5.000000
#1 Dog 30.0 14.142136
#2 Pig 30.0 5.000000
PROC MEANS with CLASS Statement, Multiple Variables and OUTPUT Statement Equivalent in Python
Finally, to finish up, if we want to have multiple variables, this is done in a similar way to above in Python.
Below is the PROC MEANS which we will be replicating in Python:
The output from the PROC MEANS is below:
The SAS dataset which is outputted is below:
To get this same structure, we need to do a little bit more work.
The first thing we can try is just add “weight” when subsetting the DataFrame after the application of groupby():
example_out4 = df.groupby("type")[["height","weight"]].describe()
This gives us the summary statistics we want, but it doesn’t quite give us the output that we are looking for. This returns a DataFrame of DataFrames – which makes working with it a little more involved than the previous examples.
We can try to use the merge() function, but things get messy fast. Also, if we wanted to do more than 2 variables, we would have to merge many times.
example_out4 = example_out4["height"].reset_index().merge(example_out4["weight"].reset_index(),on="type")
But, this works for our example – to get the output dataset, we would just need to rename some columns and then we can get the same output dataset:
example_out4.rename(columns={"mean_x":"height_avg", "std_x":"height_std","mean_y":"weight_avg", "std_y":"weight_std"}, inplace=True)
example_out4 = example_out4[["type","height_avg","height_std","weight_avg","weight_std"]]
#output:
# type height_avg height_std weight_avg weight_std
#0 Cat 20.0 5.000000 25.0 5.000000
#1 Dog 30.0 14.142136 50.0 14.142136
#2 Pig 30.0 5.000000 70.0 10.000000
However, as I mentioned above, while the code above works, it’s messy. Check out this article for how to group by multiple columns and summarize data with pandas.
The function below I prefer for finding the descriptive statistics of a DataFrame given a group variable. This function works well for relatively small datasets.
def proc_means_equiv_w_class(ds,analysis_vars,group_var):
levels = pd.unique(ds[group_var])
df = pd.DataFrame()
for i in range(0,len(levels)):
temp=ds[ds[group_var]==levels[i]]
temp2=temp[analysis_vars.split(" ")].describe().transpose()
temp2["level"]=levels[i]
temp2["nmiss"]=temp.isnull().sum()
temp2.reset_index(inplace=True)
df = df.append(temp2, ignore_index=True)
df.rename(columns={"25%":"p25", "75%":"p75", "50%": "median", "count":"n", "index":"var"}, inplace=True)
return df[['level','var','nmiss','n','mean','median','std','min','max','p25','p75']]
analysis = "height weight"
group = "type"
print(proc_means_equiv_w_class(df,analysis,group_var))
#output:
# level var nmiss n mean median std min max p25 p75
#0 Dog height 0 2.0 30.0 30.0 14.142136 20.0 40.0 25.0 35.0
#1 Dog weight 0 2.0 50.0 50.0 14.142136 40.0 60.0 45.0 55.0
#2 Cat height 0 3.0 20.0 20.0 5.000000 15.0 25.0 17.5 22.5
#3 Cat weight 0 3.0 25.0 25.0 5.000000 20.0 30.0 22.5 27.5
#4 Pig height 0 3.0 30.0 30.0 5.000000 25.0 35.0 27.5 32.5
#5 Pig weight 0 3.0 70.0 70.0 10.000000 60.0 80.0 65.0 75.0
I hope that this article has given you everything you need to know about converting your PROC MEANS procedure into Python code.