To find the standard deviation of a series or a column in a DataFrame in pandas, the easiest way is to use the pandas std() function.
df["Column1"].std()
You can also use the numpy std() function, but be careful as the default algorithm is different than the default pandas std() algorithm.
np.std(df["Column1"]) #Different result from default pandas function
np.std(df["Column1"],ddof=1) #Same result as default pandas function
When doing data analysis, the ability to compute different summary statistics, such as the mean or median of a variable, is very useful to help us understand the data. One such summary statistic which can be useful is the standard deviation of a variable.
Finding the standard deviation of columns or a Series using pandas is easy. We can use the pandas std() function to find the standard deviation of a column of numbers.
Let’s say we have the following DataFrame.
df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
'Weight': [160.20, 160.20, 209.45, 150.35, 187.52, 187.52],
'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42] })
print(df)
# Output:
Name Weight Height
0 Jim 160.20 50.10
1 Sally 160.20 68.94
2 Bob 209.45 71.42
3 Sue 150.35 48.56
4 Jill 187.52 59.37
5 Larry 187.52 63.42
To get the standard deviation of the column “Height”, we can use the pandas std() function in the following Python code:
print(df["Height"].std())
# Output:
9.49495532726019
Calculating the Standard Deviation of a Series with numpy
We can also find the standard deviation of a series using the numpy std() function. Depending on the complexity of our code, it might be faster to use the numpy std() function.
Let’s say we have the same dataset as above.
To get the standard deviation of the column “Height”, we can use the numpy std() function in the following Python code.
print(np.std(df["Height"]))
# Output:
8.667668692073754
As you can verify for yourself, this is a different result from the pandas std() function. The reason for this is the default normalization method is different between pandas and numpy.
To get the same standard deviation using both numpy and pandas, you need to pass ‘ddof=1’ to the numpy std() function.
print(np.std(df["Height"]))
print(np.std(df["Height"],ddof=1))
print(df["Height"].std())
# Output:
8.667668692073754
9.49495532726019
9.49495532726019
As you can see above, we received the same result from the code when we pass ‘ddof=1’ to the numpy std() function.
Hopefully this article has been helpful for you to understand how to find the standard deviation of a variable within a column or Series using pandas.