
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Create Correlation Matrix in Python by Traversing Each Line
A correlation matrix is a table containing correlation coefficients for many variables. Each cell in the table represents the correlation between two variables. The value might range between -1 and 1. A correlation matrix is used for summarizing the data, diagnose the advanced analysis, and as an input for a more complicated study.
Correlation matrix is used to represent the relationship between the variables in the data set. It is a type of matrix that helps programmers analyze the relationship between data components. It represents the correlation coefficient between 0 and 1.
A positive value implies a good correlation, a negative value shows a weak/low correlation, and a value of zero(0) indicates no dependency between the given set of variables.
The Regression Analysis and Correlation Matrix showed the following observations ?
Recognize the relationship between the independent variables in the data set.
Helps in the selection of significant and non-redundant variables from a data set.
This only applies to variables that are numeric or continuous.
In this article, we will show you how to create a correlation matrix using python.
Assume we have taken a CSV file with the name starbucksMenu.csv consisting of some random data. We need to create a correlation matrix for the specified columns in a dataset and plot the correlation matrix.
Input File Data
starbucksMenu.csv
Item Name | Calories | Fat | Carb | Fiber | Protein | Sodium |
Cool Lime Starbucks Refreshers? | 45 | 0 | 11 | 0 | 0 | 10 |
Evolution Fresh? Organic Ginger Limeade | 80 | 0 | 18 | 1 | 0 | 10 |
Iced Coffee | 60 | 0 | 14 | 1 | 0 | 10 |
Tazo® Bottled Berry Blossom White | 0 | 0 | 0 | 0 | 0 | 0 |
Tazo® Bottled Brambleberry | 130 | 2.5 | 21 | 0 | 5 | 65 |
Tazo® Bottled Giant Peach | 140 | 2.5 | 23 | 0 | 5 | 90 |
Tazo® Bottled Iced Passion | 130 | 2.5 | 21 | 0 | 5 | 65 |
Tazo® Bottled Plum Pomegranate | 80 | 0 | 19 | 0 | 0 | 10 |
Tazo® Bottled Tazoberry | 60 | 0 | 15 | 0 | 0 | 10 |
Tazo® Bottled White Cranberry | 150 | 0 | 38 | 0 | 0 | 15 |
Creating a Correlation Matrix
We will plot the correlation matrix for the three columns of the dataset which are independent continuous variables.
- Carb
- Protein
- Sodium
Algorithm (Steps)
Following are the Algorithm/steps to be followed to perform the desired task ?
Importing the os, pandas, NumPy, and seaborn libraries.
Read the given CSV file using the read_csv() function(loads a CSV file as a pandas dataframe).
Create the list of columns from the given dataset for which the correlation matrix must be created.
Create a correlation matrix using the corr() function(It calculates the pairwise correlation of all columns in a data frame. Any na(null) values are automatically filtered out. It is discarded for any non-numeric data type columns in the dataframe).
Print the correlation matrix of the specified columns of the dataset.
Plot the correlation matrix using the heatmap() function(For each value to be plotted, a heatmap has values indicating several shades of the same color. The darker colors of the chart typically represent higher values than the lighter shades. A completely different color can likewise be utilized for a significantly different value) of the seaborn library.
Importing the Dataset into a Pandas Dataframe
We are now first importing any sample dataset(here we are using starbucksMenu.csv ) into pandas dataframe and printing it.
Example 1
# Import pandas module as pd using the import keyword import pandas as pd # Reading a dataset givenDataset = pd.read_csv('starbucksMenu.csv') print(givenDataset)
Output
Item Name | Calories | Fat | Carb | Fiber | Protein | Sodium |
Cool Lime Starbucks Refreshers? | 45 | 0 | 11 | 0 | 0 | 10 |
Evolution Fresh? Organic Ginger Limeade | 80 | 0 | 18 | 1 | 0 | 10 |
Iced Coffee | 60 | 0 | 14 | 1 | 0 | 10 |
Tazo® Bottled Berry Blossom White | 0 | 0 | 0 | 0 | 0 | 0 |
Tazo® Bottled Brambleberry | 130 | 2.5 | 21 | 0 | 5 | 65 |
Tazo® Bottled Giant Peach | 140 | 2.5 | 23 | 0 | 5 | 90 |
Tazo® Bottled Iced Passion | 130 | 2.5 | 21 | 0 | 5 | 65 |
Tazo® Bottled Plum Pomegranate | 80 | 0 | 19 | 0 | 0 | 10 |
Tazo® Bottled Tazoberry | 60 | 0 | 15 | 0 | 0 | 10 |
Tazo® Bottled White Cranberry | 150 | 0 | 38 | 0 | 0 | 15 |
Creating correlation matrix after importing the dataset
The following program finds out how to create a correlation matrix for the given dataset, prints them, and plots the correlation matrix ?
Example 2
import os # Importing pandas module import pandas as pd import numpy as np import seaborn # Reading a dataset givenDataset = pd.read_csv('starbucksMenu.csv') # Assigning the list of columns from the dataset numericColumns = ['Carb','Protein','Sodium'] # Creating a correlation matrix correlationMatrix = givenDataset.loc[:,numericColumns].corr() # Printing the correlation matrix. print(correlationMatrix) # Displaying the correlation matrix seaborn.heatmap(correlationMatrix, annot=True)
Output
On executing, the above program will generate the following output ?