To find the variance of a series or a column in a DataFrame in pandas, the easiest way is to use the pandas var() function.

df["Column1"].var()

You can also use the numpy var() function, but be careful as the default algorithm is different than the default pandas var() algorithm.

np.var(df["Column1"]) #Different result from default pandas function
np.var(df["Column1"],ddof=1) #Same result as default pandas function

When doing data analysis, the ability to compute different summary statistics, such as the mean or median of a variable, is very useful to help us understand the data. One such summary statistic which can be useful is the variance of a variable.

The variance is the average of the squared deviations from the mean.

Finding the variance of columns or a Series using pandas is easy. We can use the pandas var() function to find the standard deviation of a column of numbers.

Let’s say we have the following DataFrame.

df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
                   'Weight': [160.20, 160.20, 209.45, 150.35, 187.52, 187.52],
                   'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42] })

print(df)
# Output: 
    Name  Weight  Height
0    Jim  160.20   50.10
1  Sally  160.20   68.94
2    Bob  209.45   71.42
3    Sue  150.35   48.56
4   Jill  187.52   59.37
5  Larry  187.52   63.42

To get the standard deviation of the column “Height”, we can use the pandas std() function in the following Python code:

print(df["Height"].var())

# Output:
90.15417666666664

Calculating the Variance of a Series with numpy

We can also find the variance of a series using the numpy std() function. Depending on the complexity of our code, it might be faster to use the numpy var() function.

Let’s say we have the same dataset as above.

To get the variance of the column “Height”, we can use the numpy var() function in the following Python code.

print(np.var(df["Height"]))

# Output:
8.667668692073754

As you can verify for yourself, this is a different result from the pandas var() function. The reason for this is the default normalization method is different between pandas and numpy. This is because, by default, pandas provides an unbiased estimator of the variance of a hypothetical infinite population, or uses 1 delta degree of freedom.

To get the same variance using both numpy and pandas, you need to pass ‘ddof=1’ to the numpy var() function.

print(np.var(df["Height"]))
print(np.var(df["Height"],ddof=1))
print(df["Height"].var())

# Output:
75.12848055555554
90.15417666666664
90.15417666666664

As you can see above, we received the same result from the code when we pass ‘ddof=1’ to the numpy var() function.

Hopefully this article has been helpful for you to understand how to find the variance of a variable within a column or Series using pandas.

Categorized in:

Python,

Last Update: March 20, 2024