When working with data as a data science or data analyst, calculating frequencies is very common and something that many industries and companies utilize to compare the means of two distinct populations.
There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.
A commonly used procedure for performing frequency analysis in SAS is the PROC FREQ procedure. In general, the two main reasons that SAS programmers use PROC FREQ are to calculate frequencies and to perform chi-square analyses on categorical variables.
In this article, you’ll learn the Python equivalent of PROC FREQ for frequency analysis and see how you can calculate frequencies and cross tabulations with Python, as well as perform chi-square analyses on your data.
PROC FREQ Equivalent in Python for Performing Frequency Analyses
First, let’s talk about how to calculate frequencies using pandas and Python.
Let’s say I have the following dataset:
animal_type gender weight age state trained
0 cat male 10 1 CA no
1 dog male 20 4 FL no
2 dog male 30 5 NY no
3 cat female 40 3 FL yes
4 cat female 10 2 NY yes
5 dog female 20 4 TX yes
6 cat female 50 6 TX yes
7 dog male 60 1 CA no
8 dog male 70 5 NY no
9 cat female 80 4 FL yes
10 cat female 90 3 TX yes
11 cat male 100 2 TX no
12 dog female 80 4 FL no
If I’m in SAS, to get the frequencies of the column “animal_type”, we would do the following with PROC FREQ:
The result of running this SAS code is shown below:
To calculate the frequencies of different levels of a variable using pandas, we can use the pandas value_counts() function.
To get the frequencies of the different values of the “animal_type” column, we can use the pandas value_counts() function with following Python code:
data["animal_type"].value_counts()
#output:
cat 7
dog 6
Name: animal_type, dtype: int64
To get the percentages of the different values of the “animal_type” column, we can pass the option “normalize=True” to the pandas value_counts() function with following Python code:
data["animal_type"].value_counts(normalize=True)
#output:
cat 0.538462
dog 0.461538
Name: animal_type, dtype: float64
To get the cumulative counts and cumulative frequencies for the different levels of a variable, we can use the following function:
def single_freq(ds,var1):
f=ds[var1].value_counts(dropna=False)
p=ds[var1].value_counts(dropna=False, normalize=True)
df=pd.concat([f,p], axis=1, keys=['frequency', 'percent'])
df["cumfrequency"] = df["frequency"].cumsum()
df["cumpercent"] = df["percent"].cumsum()
return df
single_freq(data,"animal_type")
#output:
frequency percent cumfrequency cumpercent
cat 7 0.538462 7 0.538462
dog 6 0.461538 13 1.000000
As shown above, the final output here is the same as the SAS output for the PROC FREQ with one analysis variable.
PROC FREQ Tables Crosstab Equivalent in Python
Many times when looking at data, we want to look at and understand the distributions of different segmentations of variables.
To do a crosstab in SAS, we just add another variable to the “tables” statement.
Let’s say we want to do a simple crosstab between the columns “animal_type” and “gender” in our example. The following SAS code will give us the cross tabulation between “animal_type” and “gender”:
The output is shown below:
To do a simple cross tabulation using Python, we can use the pandas crosstab() function in the following way:
pd.crosstab(data["animal_type"],data["gender"])
#output:
gender female male
animal_type
cat 5 2
dog 2 4
If you want to find the percentages, we can pass the “normalize=’all'” option to the crosstab() function.
pd.crosstab(data["animal_type"],data["gender"], normalize='all')
#output:
gender female male
animal_type
cat 0.384615 0.153846
dog 0.153846 0.307692
We can also get the row and column percentages by passing “normalize=’index'” or passing “normalize=’columns'” to the crosstab() function:
pd.crosstab(data["animal_type"],data["gender"], normalize='index')
#output:
gender female male
animal_type
cat 0.714286 0.285714
dog 0.333333 0.666667
pd.crosstab(data["animal_type"],data["gender"], normalize='columns')
#output:
gender female male
animal_type
cat 0.714286 0.333333
dog 0.285714 0.666667
While simple crosstabs are great, we can also create a crosstab for multiple columns.
With SAS, again, it’s easy – we just need to add another variable to the tables statement.
The resulting dataset is as follows:
Below is a function which will allow you to create a crosstab for multiple columns using pandas.
def frequency(ds, vars):
if len(vars) > 1:
c1 = ds[vars[0]]
c2 = []
for i in range(1,len(vars)):
c2.append(ds[vars[i]])
dfs = []
dfs.append(pd.crosstab(c1,c2).unstack().reset_index().rename(columns={0:'Count'}))
dfs.append(pd.crosstab(c1,c2, normalize='all').unstack().reset_index().rename(columns={0:'Percent'}))
dfs.append(pd.crosstab(c1,c2, normalize='columns').unstack().reset_index().rename(columns={0:'Column Percent'}))
dfs.append(pd.crosstab(c1,c2, normalize='index').unstack().reset_index().rename(columns={0:'Row Percent'}))
dfs = [df.set_index(vars) for df in dfs]
df = dfs[0].join(dfs[1:]).reset_index()
return df
Here’s the output of our function which gives us the counts and percentages of each segment in our dataframe, and also the row and column percentages in our crosstab:
frequency(data,["animal_type","gender","state","trained"])
#output
animal_type gender state trained Count Percent Column Percent Row Percent
0 cat female FL no 0 0.000000 0.000000 0.000000
1 dog female FL no 1 0.076923 1.000000 0.166667
2 cat female FL yes 2 0.153846 1.000000 0.285714
3 dog female FL yes 0 0.000000 0.000000 0.000000
4 cat female NY yes 1 0.076923 1.000000 0.142857
5 dog female NY yes 0 0.000000 0.000000 0.000000
6 cat female TX yes 2 0.153846 0.666667 0.285714
7 dog female TX yes 1 0.076923 0.333333 0.166667
8 cat male CA no 1 0.076923 0.500000 0.142857
9 dog male CA no 1 0.076923 0.500000 0.166667
10 cat male FL no 0 0.000000 0.000000 0.000000
11 dog male FL no 1 0.076923 1.000000 0.166667
12 cat male NY no 0 0.000000 0.000000 0.000000
13 dog male NY no 2 0.153846 1.000000 0.333333
14 cat male TX no 1 0.076923 1.000000 0.142857
15 dog male TX no 0 0.000000 0.000000 0.000000
As shown above, the results between our Python crosstabs and the SAS outputs are the same.
Performing Chi-Square Analysis Using Python
Just like with PROC FREQ in SAS, we can do chi-square analysis using Python. Using the scipy.stats package, we can do one-way and two-way chi-square analysis.
In SAS, to perform a chi-square analysis, we just add the chisq option at the end of the “tables” statement.
The result of the one way chi-square analysis is shown below:
From the example data above, we can do a one-way chi-square on the “animal_type” column in following Python code using the scipy.stats chisquare function.
from scipy.stats import chisquare
chisquare(data["animal_type"].value_counts())
#output:
Power_divergenceResult(statistic=0.07692307692307693, pvalue=0.7815112949987134)
The pvalue of 0.785 shows the distribution of the values of “animal_type” are not statistically different from each other (which we know from above – 7 is not much different than 6).
Chi-Square Analysis of Contingency Table Using Python
In SAS, to perform a chi-square analysis of a contingency table, this is done in the same way as above – by adding the chisq option after the tables statement.
The SAS output for the chi-square test of a contingency table is below:
For a chi-square test of a contingency table in Python, we first need to get the crosstab of two columns and then we can pass it to the scipy.stats chi2_contingency function.
from scipy.stats import chi2_contingency
chi2_contingency(pd.crosstab(data["animal_type"],data["gender"]),correction=False)
#output:
(1.886621315192744, 0.1695834964923999, 1, array([[3.76923077, 3.23076923], [3.23076923, 2.76923077]]))
#The Chi-Square statistic is the first value, or 1.886621315192744
#The p_value is the second value, or 0.1695834964923999.
As you can verify by looking at the SAS output, the chi-square statistic and p-values are the same.
Something to know, this is just the basic two-way chi-square, what I found is that the Mantel-Haenszel Chi-Square statistic is not implemented in Python very well. The fisher’s exact statistic can be found with the scipy.stats fisher_exact() method.
Hopefully this article has helped you replicate the PROC FREQ procedure in your Python code.