How to create Correlation Matrix in Python by traversing through each line?

Programming Python Server Side Programming

A correlation matrix is a table containing correlation coefficients for many variables. Each cell in the table represents the correlation between two variables. The value might range between -1 and 1. A correlation matrix is used for summarizing the data, diagnose the advanced analysis, and as an input for a more complicated study.

Correlation matrix is used to represent the relationship between the variables in the data set. It is a type of matrix that helps programmers analyze the relationship between data components. It represents the correlation coefficient between 0 and 1.

A positive value implies a good correlation, a negative value shows a weak/low correlation, and a value of zero(0) indicates no dependency between the given set of variables.

The Regression Analysis and Correlation Matrix showed the following observations ?

Recognize the relationship between the independent variables in the data set.
Helps in the selection of significant and non-redundant variables from a data set.
This only applies to variables that are numeric or continuous.

In this article, we will show you how to create a correlation matrix using python.

Assume we have taken a CSV file with the name starbucksMenu.csv consisting of some random data. We need to create a correlation matrix for the specified columns in a dataset and plot the correlation matrix.

Input File Data

starbucksMenu.csv

Item Name	Calories	Fat	Carb	Fiber	Protein	Sodium

Cool Lime Starbucks Refreshers?	45	0	11	0	0	10
Evolution Fresh? Organic Ginger Limeade	80	0	18	1	0	10
Iced Coffee	60	0	14	1	0	10
TazoÂ® Bottled Berry Blossom White	0	0	0	0	0	0
TazoÂ® Bottled Brambleberry	130	2.5	21	0	5	65
TazoÂ® Bottled Giant Peach	140	2.5	23	0	5	90
TazoÂ® Bottled Iced Passion	130	2.5	21	0	5	65
TazoÂ® Bottled Plum Pomegranate	80	0	19	0	0	10
TazoÂ® Bottled Tazoberry	60	0	15	0	0	10
TazoÂ® Bottled White Cranberry	150	0	38	0	0	15

Creating a Correlation Matrix

We will plot the correlation matrix for the three columns of the dataset which are independent continuous variables.

Carb
Protein
Sodium

Algorithm (Steps)

Following are the Algorithm/steps to be followed to perform the desired task ?

Importing the os, pandas, NumPy, and seaborn libraries.
Read the given CSV file using the read_csv() function(loads a CSV file as a pandas dataframe).
Create the list of columns from the given dataset for which the correlation matrix must be created.
Create a correlation matrix using the corr() function(It calculates the pairwise correlation of all columns in a data frame. Any na(null) values are automatically filtered out. It is discarded for any non-numeric data type columns in the dataframe).
Print the correlation matrix of the specified columns of the dataset.
Plot the correlation matrix using the heatmap() function(For each value to be plotted, a heatmap has values indicating several shades of the same color. The darker colors of the chart typically represent higher values than the lighter shades. A completely different color can likewise be utilized for a significantly different value) of the seaborn library.

Importing the Dataset into a Pandas Dataframe

We are now first importing any sample dataset(here we are using starbucksMenu.csv ) into pandas dataframe and printing it.

Example 1

# Import pandas module as pd using the import keyword
import pandas as pd
# Reading a dataset
givenDataset = pd.read_csv('starbucksMenu.csv')
print(givenDataset)

Output

Item Name	Calories	Fat	Carb	Fiber	Protein	Sodium

Cool Lime Starbucks Refreshers?	45	0	11	0	0	10
Evolution Fresh? Organic Ginger Limeade	80	0	18	1	0	10
Iced Coffee	60	0	14	1	0	10
TazoÂ® Bottled Berry Blossom White	0	0	0	0	0	0
TazoÂ® Bottled Brambleberry	130	2.5	21	0	5	65
TazoÂ® Bottled Giant Peach	140	2.5	23	0	5	90
TazoÂ® Bottled Iced Passion	130	2.5	21	0	5	65
TazoÂ® Bottled Plum Pomegranate	80	0	19	0	0	10
TazoÂ® Bottled Tazoberry	60	0	15	0	0	10
TazoÂ® Bottled White Cranberry	150	0	38	0	0	15

Creating correlation matrix after importing the dataset

The following program finds out how to create a correlation matrix for the given dataset, prints them, and plots the correlation matrix ?

Example 2

import os
# Importing pandas module
import pandas as pd
import numpy as np
import seaborn 
# Reading a dataset
givenDataset = pd.read_csv('starbucksMenu.csv')
# Assigning the list of columns from the dataset 
numericColumns = ['Carb','Protein','Sodium']

# Creating a correlation matrix 
correlationMatrix  = givenDataset.loc[:,numericColumns].corr()
# Printing the correlation matrix.
print(correlationMatrix)
# Displaying the correlation matrix
seaborn.heatmap(correlationMatrix, annot=True)

Output

On executing, the above program will generate the following output ?

You learned how to compute a correlation matrix using Python and Pandas in this tutorial. Along with that you have learned how to generate a correlation matrix using the Pandas corr() method and also how to utilize the Seaborn library's heatmap function to show a matrix, allowing you to better visualize and understand the data at a glance.

Vikram Chiluka

Updated on: 2022-08-10T09:29:47+05:30

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started