When working with data as a data science or data analyst, calculating frequencies is very common and something that many industries and companies utilize to compare the means of two distinct populations.

There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.

A commonly used procedure for performing frequency analysis in SAS is the PROC FREQ procedure. In general, the two main reasons that SAS programmers use PROC FREQ are to calculate frequencies and to perform chi-square analyses on categorical variables.

In this article, you’ll learn the Python equivalent of PROC FREQ for frequency analysis and see how you can calculate frequencies and cross tabulations with Python, as well as perform chi-square analyses on your data.

PROC FREQ Equivalent in Python for Performing Frequency Analyses

First, let’s talk about how to calculate frequencies using pandas and Python.

Let’s say I have the following dataset:

   animal_type  gender  weight  age state trained
0          cat    male      10    1    CA      no
1          dog    male      20    4    FL      no
2          dog    male      30    5    NY      no
3          cat  female      40    3    FL     yes
4          cat  female      10    2    NY     yes
5          dog  female      20    4    TX     yes
6          cat  female      50    6    TX     yes
7          dog    male      60    1    CA      no
8          dog    male      70    5    NY      no
9          cat  female      80    4    FL     yes
10         cat  female      90    3    TX     yes
11         cat    male     100    2    TX      no
12         dog  female      80    4    FL      no

If I’m in SAS, to get the frequencies of the column “animal_type”, we would do the following with PROC FREQ:

proc freq one variable

The result of running this SAS code is shown below:

proc freq output

To calculate the frequencies of different levels of a variable using pandas, we can use the pandas value_counts() function.

To get the frequencies of the different values of the “animal_type” column, we can use the pandas value_counts() function with following Python code:

data["animal_type"].value_counts()

#output:
cat    7
dog    6
Name: animal_type, dtype: int64

To get the percentages of the different values of the “animal_type” column, we can pass the option “normalize=True” to the pandas value_counts() function with following Python code:

data["animal_type"].value_counts(normalize=True)

#output:
cat    0.538462
dog    0.461538
Name: animal_type, dtype: float64

To get the cumulative counts and cumulative frequencies for the different levels of a variable, we can use the following function:

def single_freq(ds,var1):
    f=ds[var1].value_counts(dropna=False)
    p=ds[var1].value_counts(dropna=False, normalize=True)
    df=pd.concat([f,p], axis=1, keys=['frequency', 'percent'])
    df["cumfrequency"] = df["frequency"].cumsum()
    df["cumpercent"] = df["percent"].cumsum()
    return df

single_freq(data,"animal_type")

#output:
     frequency   percent  cumfrequency  cumpercent
cat          7  0.538462             7    0.538462
dog          6  0.461538            13    1.000000

As shown above, the final output here is the same as the SAS output for the PROC FREQ with one analysis variable.

PROC FREQ Tables Crosstab Equivalent in Python

Many times when looking at data, we want to look at and understand the distributions of different segmentations of variables.

To do a crosstab in SAS, we just add another variable to the “tables” statement.

Let’s say we want to do a simple crosstab between the columns “animal_type” and “gender” in our example. The following SAS code will give us the cross tabulation between “animal_type” and “gender”:

proc freq crosstab

The output is shown below:

proc freq crosstab output

To do a simple cross tabulation using Python, we can use the pandas crosstab() function in the following way:

pd.crosstab(data["animal_type"],data["gender"])

#output:
gender       female  male
animal_type
cat               5     2
dog               2     4

If you want to find the percentages, we can pass the “normalize=’all'” option to the crosstab() function.

pd.crosstab(data["animal_type"],data["gender"], normalize='all')

#output:
gender         female      male
animal_type
cat          0.384615  0.153846
dog          0.153846  0.307692

We can also get the row and column percentages by passing “normalize=’index'” or passing “normalize=’columns'” to the crosstab() function:

pd.crosstab(data["animal_type"],data["gender"], normalize='index')

#output:
gender         female      male
animal_type
cat          0.714286  0.285714
dog          0.333333  0.666667

pd.crosstab(data["animal_type"],data["gender"], normalize='columns')

#output:
gender         female      male
animal_type
cat          0.714286  0.333333
dog          0.285714  0.666667

While simple crosstabs are great, we can also create a crosstab for multiple columns.

With SAS, again, it’s easy – we just need to add another variable to the tables statement.

multiple column crosstab SAS

The resulting dataset is as follows:

multiple column crosstab SAS output

Below is a function which will allow you to create a crosstab for multiple columns using pandas.

def frequency(ds, vars):
    if len(vars) > 1:
        c1 = ds[vars[0]]
        c2 = []
        for i in range(1,len(vars)):
            c2.append(ds[vars[i]])
        dfs = []
        dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
        dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
        dfs.append(pd.crosstab(c1,c2, normalize='columns').unstack().reset_index().rename(columns={0:'Column Percent'}))
        dfs.append(pd.crosstab(c1,c2, normalize='index').unstack().reset_index().rename(columns={0:'Row Percent'}))
        dfs = [df.set_index(vars) for df in dfs]
        df = dfs[0].join(dfs[1:]).reset_index()
        return df

Here’s the output of our function which gives us the counts and percentages of each segment in our dataframe, and also the row and column percentages in our crosstab:

frequency(data,["animal_type","gender","state","trained"])

#output
   animal_type  gender state trained  Count   Percent  Column Percent  Row Percent
0          cat  female    FL      no      0  0.000000        0.000000     0.000000
1          dog  female    FL      no      1  0.076923        1.000000     0.166667
2          cat  female    FL     yes      2  0.153846        1.000000     0.285714
3          dog  female    FL     yes      0  0.000000        0.000000     0.000000
4          cat  female    NY     yes      1  0.076923        1.000000     0.142857
5          dog  female    NY     yes      0  0.000000        0.000000     0.000000
6          cat  female    TX     yes      2  0.153846        0.666667     0.285714
7          dog  female    TX     yes      1  0.076923        0.333333     0.166667
8          cat    male    CA      no      1  0.076923        0.500000     0.142857
9          dog    male    CA      no      1  0.076923        0.500000     0.166667
10         cat    male    FL      no      0  0.000000        0.000000     0.000000
11         dog    male    FL      no      1  0.076923        1.000000     0.166667
12         cat    male    NY      no      0  0.000000        0.000000     0.000000
13         dog    male    NY      no      2  0.153846        1.000000     0.333333
14         cat    male    TX      no      1  0.076923        1.000000     0.142857
15         dog    male    TX      no      0  0.000000        0.000000     0.000000

As shown above, the results between our Python crosstabs and the SAS outputs are the same.

Performing Chi-Square Analysis Using Python

Just like with PROC FREQ in SAS, we can do chi-square analysis using Python. Using the scipy.stats package, we can do one-way and two-way chi-square analysis.

In SAS, to perform a chi-square analysis, we just add the chisq option at the end of the “tables” statement.

chisquare sas

The result of the one way chi-square analysis is shown below:

chisquare sas output

From the example data above, we can do a one-way chi-square on the “animal_type” column in following Python code using the scipy.stats chisquare function.

from scipy.stats import chisquare

chisquare(data["animal_type"].value_counts())

#output:
Power_divergenceResult(statistic=0.07692307692307693, pvalue=0.7815112949987134)

The pvalue of 0.785 shows the distribution of the values of “animal_type” are not statistically different from each other (which we know from above – 7 is not much different than 6).

Chi-Square Analysis of Contingency Table Using Python

In SAS, to perform a chi-square analysis of a contingency table, this is done in the same way as above – by adding the chisq option after the tables statement.

chisquare contingency sas

The SAS output for the chi-square test of a contingency table is below:

chisquare contingency output sas

For a chi-square test of a contingency table in Python, we first need to get the crosstab of two columns and then we can pass it to the scipy.stats chi2_contingency function.

from scipy.stats import chi2_contingency

chi2_contingency(pd.crosstab(data["animal_type"],data["gender"]),correction=False)

#output:
(1.886621315192744, 0.1695834964923999, 1, array([[3.76923077, 3.23076923], [3.23076923, 2.76923077]]))
#The Chi-Square statistic is the first value, or 1.886621315192744
#The p_value is the second value, or 0.1695834964923999.

As you can verify by looking at the SAS output, the chi-square statistic and p-values are the same.

Something to know, this is just the basic two-way chi-square, what I found is that the Mantel-Haenszel Chi-Square statistic is not implemented in Python very well. The fisher’s exact statistic can be found with the scipy.stats fisher_exact() method.

Hopefully this article has helped you replicate the PROC FREQ procedure in your Python code.

Categorized in:

Python,

Last Update: February 26, 2024