To find the correlation between series or columns in a DataFrame in pandas, the easiest way is to use the pandas corr() function.
df["Column1"].corr(df["Column2"])
If you want to compute the pairwise correlations between all numeric columns in a DataFrame, you can call corr() directly on the DataFrame.
df.corr()
You can also use the pandas corrwith() function to compute the correlation of the columns of a DataFrame with another Series.
df.corrwith(df2["Column"])
Finding the correlation between columns or Series using pandas is easy. We can use the pandas corr() function to find the correlations of columns of numbers, or the correlation between multiple Series.
Let’s say we have the following DataFrame.
df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
'Weight': [160.20, 160.20, 209.45, 150.35, 187.52, 187.52],
'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42] })
print(df)
# Output:
Name Weight Height
0 Jim 160.20 50.10
1 Sally 160.20 68.94
2 Bob 209.45 71.42
3 Sue 150.35 48.56
4 Jill 187.52 59.37
5 Larry 187.52 63.42
To get the pairwise correlation between the columns “Weight” and “Height”, we can use the pandas corr() function in the following Python code:
print(df["Height"].corr(df["Weight"]))
# Output:
0.6754685833670168
The pandas corr() function allow us to compute a few different types of correlation, namely, Pearson correlation, Kendall Tau correlation, and the Spearman Rank correlation. You can also pass your own function if you’d like.
To calculate these correlation coefficients, just pass method=”kendall” or method=”spearman” to the corr() function.
Note you will have to import the module scipy to find the kendall and spearman coefficients.
df["Height"].corr(df["Weight"], method="pearson")
df["Height"].corr(df["Weight"], method="kendall")
df["Height"].corr(df["Weight"], method="spearman")
Calculating the Correlation between Multiple Columns in pandas
There are many time when analyzing a dataset that we want to see the correlations between all variables. We can use the pandas corr() method to calculate the correlation over all columns.
Let’s say we have the same DataFrame from above, but now we’ve added another column “Age”.
df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
'Weight': [130.54, 160.20, 209.45, 150.35, 117.73, 187.52],
'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42],
'Age': [43,23,71,49,52,37] })
print(df)
# Output:
Name Weight Height Age
0 Jim 130.54 50.10 43
1 Sally 160.20 68.94 23
2 Bob 209.45 71.42 71
3 Sue 150.35 48.56 49
4 Jill 117.73 59.37 52
5 Larry 187.52 63.42 37
We can get the pairwise correlation coefficients for all columns by calling the corr() function. In this case, the corr() function will return a correlation matrix.
print(df.corr())
#Output:
Weight Height Age
Weight 1.000000 0.666055 0.285006
Height 0.666055 1.000000 0.053793
Age 0.285006 0.053793 1.000000
Finding Correlation with pandas corrwith() function
We can also use the pandas corrwith() function to calculate the correlation coefficient between a DataFrame and columns of another DataFrame or Series.
Let’s say we have the same dataset from above, and let’s say we have another DataFrame that we’d like to see if it is correlated with our DataFrame from the previous example.
df = pd.DataFrame({'Name': ['Jim', 'Sally', 'Bob', 'Sue', 'Jill', 'Larry'],
'Weight': [130.54, 160.20, 209.45, 150.35, 117.73, 187.52],
'Height': [50.10, 68.94, 71.42, 48.56, 59.37, 63.42],
'Age': [43,23,71,49,52,37] })
df_new = pd.DataFrame({'Test_Score':[90,87,92,96,84,79]})
We can find the correlation between the columns of two DataFrames using the pandas corrwith() function.
print(df.corrwith(df_new["Test_Score"]))
#Output:
Weight -0.016455
Height -0.359045
Age 0.408819
dtype: float64
Hopefully this article has been helpful for you to understand how to find the correlation coefficients between columns in a DataFrame or between Series using pandas.