Open In App

How to Calculate Correlation Between Two Columns in Pandas?

Last Updated : 31 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Correlation is used to summarize the strength and direction of the linear association between two quantitative variables. It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. Let's explore several methods to calculate correlation between columns in a pandas DataFrame.

Using Series.corr()

corr() calculates the Pearson correlation coefficient between two individual columns (Series) in a pandas DataFrame. It’s simple and quick when you want to check the correlation between just two variables.

Python
import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})
corr = data['math_score'].corr(data['science_score'])
print(corr)

Output
0.9931532689569343

Explanation: This code computes the correlation coefficient between math and science scores, a value between -1 and +1 that measures the strength and direction of their linear relationship. +1 indicates a perfect positive correlation, -1 a perfect negative and 0 means no linear correlation.

Using Dataframe.corr()

Dataframe corr() computes the correlation matrix for all numeric columns in the DataFrame. It returns pairwise correlation coefficients between all columns, making it easy to see relationships across multiple variables at once.

Python
import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})
res = data.corr()
print(res)

Output

Output
Using Dataframe.corr()

Explanation: This code calculates the correlation matrix for all numeric columns in the dataframe, showing the pairwise correlation coefficients between each subject's scores. Each value ranges from -1 to +1, indicating the strength and direction of the linear relationships among the columns.

Using numpy.corrcoef()

corrcoef() from the NumPy library calculates the Pearson correlation coefficient matrix between two arrays. It is useful when working directly with NumPy arrays or when pandas is not required.

Python
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

corr = np.corrcoef(data['math_score'], data['english_score'])[0, 1]
print(corr)

Output
0.976632340152094

Explanation: This code calculates the Pearson correlation coefficient between math and English scores using NumPy’s corrcoef function. It returns a value between -1 and +1 that measures the strength and direction of the linear relationship between these two columns.

Using scipy.stats.pearsonr()

This function calculates the Pearson correlation coefficient along with the p-value to test the hypothesis of no correlation. It is helpful if you want to know both the strength of the correlation and its statistical significance.

Python
from scipy.stats import pearsonr
import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})
corr, p_value = pearsonr(data['science_score'], data['history_score'])
print(corr)
print(p_value)

Output

0.9045939369328619 
0.03486446724084317

Explanation: This code calculates the Pearson correlation and p-value between science and history scores, showing their linear relationship and its statistical significance.


Next Article

Similar Reads