When working with data as a data science or data analyst, it’s important to be able to find the basic descriptive statistics of a set of data.

There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.

One of the most commonly used procedures in SAS is the PROC MEANS procedure. In this article, you’ll learn the Python equivalent of PROC MEANS (and note, getting a Python equivalent of PROC SUMMARY and PROC HPSUMMARY will be similar).

This article contains the following examples (you can skip to one using the links below or keep reading to reach them sequentially):

When using PROC MEANS, we need to provide a dataset, class and analysis variables, statistical options, and output datasets.

Below is an example of PROC MEANS which we will replicate in Python.

Let’s say we have data such as the following:

example-data

In SAS, we can read this in using a XLSX libname statement or PROC IMPORT.

Next, we want to get some descriptive statistics using PROC MEANS.

PROC MEANS Equivalent in Python

In SAS, when we want to find the descriptive statistics of a variable in a dataset, we use the PROC MEANS procedure.

Below is the PROC MEANS I’m going to replicate in Python:

proc-means

The output from this PROC MEANS is below:

proc-means-results

To get the Python equivalent of PROC MEANS, we will use the pandas library of Python, and utilize the describe() function:

import pandas as pd
import numpy as np

df = pd.read_excel(path + "example_data.xlsx")
df["height"].describe()

#output:
#count 8.00000
#mean 26.25000
#std   8.34523
#min  15.00000
#25%  20.00000
#50%  25.00000
#75%  31.25000
#max  40.00000
#Name: height, dtype: float64

As you can see, we get the same results, except for the percentiles. That is because the pandas .describe() function does a linear interpolation. To get the quantile, you should use the quantile() function.

PROC MEANS with OUTPUT Statement Equivalent in Python

Many times, we want to take the descriptive statistics from a data and create new data with these statistics.

With PROC MEANS, we can add an output statement and get the following outputted dataset.

proc-means-output

proc-means-output-data

Doing this in Python is super easy. All you need to do is store the outputted DataFrame in a variable:

example_out1 = df["height"].describe()

Now, you can use this new DataFrame like any other DataFrame – just like the outputted dataset in SAS.

PROC MEANS with Multiple Variables and OUTPUT Statement Equivalent in Python

Of course, when doing data analysis, usually we want to look at multiple variables and multiple groups.

In SAS, adding another analysis variable is very easy. Below is the PROC MEANS from above with the “weight” variable now added.

proc-means-multiple-output

Here’s the output and output dataset.

proc-means-multiple-results

proc-means-multiple-data

To replicate this PROC MEANS in Python, all you need to do is add another variable when subsetting the DataFrame.

example_out2 = df[["height","weight"]].describe()

print(example_out2)

#output:
#       height    weight
#count 8.00000  8.000000
#mean 26.25000 48.125000
#std   8.34523 22.350695
#min  15.00000 20.000000
#25%  20.00000 28.750000
#50%  25.00000 50.000000
#75%  31.25000 62.500000
#max  40.00000 80.000000

PROC MEANS with NMISS Equivalent in Python

One thing that the describe() function does not do is calculated the number of missing values.

To calculate the number of missing values in SAS with PROC MEANS is easily done with the NMISS option.

proc-means-nmiss

The output of the above PROC MEANS shows no missing values for the “height” variable:proc-means-nmiss-results

To get the number of missing values of a series in Python, we use the isnull() and sum() functions.

nmiss = df["height"].isnull().sum()

print(nmiss)

#output:
#0

PROC MEANS with CLASS Statement Equivalent in Python

Next, when doing data analysis, usually, we want to find descriptive statistics by different groups.

For our data, for example, we have the “type” variable and this variable has different types of animal.

When presenting our data, we know that dogs are different than cats, and cats are different than pigs.

When creating the PROC MEANS to get the descriptive statistics by group, all we need to do is add CLASS to the PROC MEANS.

The output from this PROC MEANS is shown below:

proc-means-class-results

Here is the outputted dataset from the above PROC MEANS:

proc-means-class-data

To get the Python equivalent of PROC MEANS with a CLASS statement, we can do the following.

The pandas DataFrame has a function groupby() which allows you to group the data.

Using this function, we can get the same output as above:

example_out3 = df.groupby("type")["height"].describe().reset_index()

print(example_out3)

#output:
#   type count  mean       std  min  25%  50%  75%  max
#0   Cat   3.0  20.0  5.000000 15.0 17.5 20.0 22.5 25.0
#1   Dog   2.0  30.0 14.142136 20.0 25.0 30.0 35.0 40.0
#2   Pig   3.0  30.0  5.000000 25.0 27.5 30.0 32.5 35.0

To get exactly the outputted data from above, we can keep only the columns we want (mean and std), and rename those columns.

example_out3.rename(columns={"mean":"height_avg", "std":"height_std"}, inplace=True)

example_out3 = example_out3[["type","height_avg","height_std"]]

print(example_out3)

#output:
#    type height_avg height_std
#0    Cat       20.0   5.000000
#1    Dog       30.0  14.142136
#2    Pig       30.0   5.000000

PROC MEANS with CLASS Statement, Multiple Variables and OUTPUT Statement Equivalent in Python

Finally, to finish up, if we want to have multiple variables, this is done in a similar way to above in Python.

Below is the PROC MEANS which we will be replicating in Python:

proc-means-class-multiple

The output from the PROC MEANS is below:

proc-means-class-multiple-results

The SAS dataset which is outputted is below:

proc-means-class-multiple-data

To get this same structure, we need to do a little bit more work.

The first thing we can try is just add “weight” when subsetting the DataFrame after the application of groupby():

example_out4 = df.groupby("type")[["height","weight"]].describe()

This gives us the summary statistics we want, but it doesn’t quite give us the output that we are looking for. This returns a DataFrame of DataFrames – which makes working with it a little more involved than the previous examples.

We can try to use the merge() function, but things get messy fast. Also, if we wanted to do more than 2 variables, we would have to merge many times.

example_out4 = example_out4["height"].reset_index().merge(example_out4["weight"].reset_index(),on="type")

But, this works for our example – to get the output dataset, we would just need to rename some columns and then we can get the same output dataset:

example_out4.rename(columns={"mean_x":"height_avg", "std_x":"height_std","mean_y":"weight_avg", "std_y":"weight_std"}, inplace=True)

example_out4 = example_out4[["type","height_avg","height_std","weight_avg","weight_std"]]

#output:
#   type height_avg  height_std   weight_avg   weight_std
#0   Cat       20.0    5.000000         25.0     5.000000
#1   Dog       30.0   14.142136         50.0    14.142136
#2   Pig       30.0    5.000000         70.0    10.000000

However, as I mentioned above, while the code above works, it’s messy. Check out this article for how to group by multiple columns and summarize data with pandas.

The function below I prefer for finding the descriptive statistics of a DataFrame given a group variable. This function works well for relatively small datasets.

def proc_means_equiv_w_class(ds,analysis_vars,group_var):
    levels = pd.unique(ds[group_var])
    df = pd.DataFrame()
    for i in range(0,len(levels)):
        temp=ds[ds[group_var]==levels[i]]
        temp2=temp[analysis_vars.split(" ")].describe().transpose()             
        temp2["level"]=levels[i]
        temp2["nmiss"]=temp.isnull().sum()
        temp2.reset_index(inplace=True)
        df = df.append(temp2, ignore_index=True)
    df.rename(columns={"25%":"p25", "75%":"p75", "50%": "median", "count":"n", "index":"var"}, inplace=True)
    return df[['level','var','nmiss','n','mean','median','std','min','max','p25','p75']]

analysis = "height weight"
group = "type"

print(proc_means_equiv_w_class(df,analysis,group_var))

#output:
#    level      var nmiss   n mean median         std   min    max  p25   p75
#0     Dog   height     0 2.0 30.0   30.0   14.142136  20.0   40.0 25.0  35.0
#1     Dog   weight     0 2.0 50.0   50.0   14.142136  40.0   60.0 45.0  55.0
#2     Cat   height     0 3.0 20.0   20.0    5.000000  15.0   25.0 17.5  22.5
#3     Cat   weight     0 3.0 25.0   25.0    5.000000  20.0   30.0 22.5  27.5
#4     Pig   height     0 3.0 30.0   30.0    5.000000  25.0   35.0 27.5  32.5
#5     Pig   weight     0 3.0 70.0   70.0   10.000000  60.0   80.0 65.0  75.0

I hope that this article has given you everything you need to know about converting your PROC MEANS procedure into Python code.

Categorized in:

Python,

Last Update: April 1, 2024