In a dataset, for k set of variables/columns (X1, X2, ....Xk), the scatter plot matrix plot all the pairwise scatter between different variables in the form of a matrix.
Scatter plot matrix answer the following questions:
- Are there any pair-wise relationships between different variables? And if there are relationships, what is the nature of these relationships?
- Are there any outliers in the dataset?
- Is there any clustering by groups present in the dataset on the basis of a particular variable?
For k variables in the dataset, the scatter plot matrix contains k rows and k columns. Each row and column represents as a single scatter plot. Each individual plot (i, j) can be defined as:
- Vertical Axis: Variable Xj
- Horizontal Axis: Variable Xi
Below are some important factors we consider when plotting the Scatter plot matrix:
- The plot lies on the diagonal is just a 45 line because we are plotting here Xi vs Xi. However, we can plot the histogram for the Xi in the diagonals or just leave it blank.
- Since Xi vs Xj is equivalent to Xj vs Xi with the axes reversed, we can also omit the plots below the diagonal.
- It can be more helpful if we overlay some line plot on the scattered points in the plots to give more understanding of the plot.
- The idea of the pair-wise plot can also be extended to different other plots such as quantile-quantile plots or bihistogram.
Implementation
- For this implementation, we will be using the Titanic dataset. This dataset can be downloaded from Kaggle. Before plotting the scatter matrix, we will be performing some preprocessing operations on the dataframe to obtain it into the desired form.
Python3
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline
# load titanic dataset
titanic_dataset = pd.read_csv('tested.csv.xls')
titanic_dataset.head()
# Drop some unimportant columns in the dataset.
titanic_dataset.drop(['Name', 'Ticket','Cabin','PassengerId'],axis=1, inplace=True)
# check for different data types
titanic_dataset.dtypes
# print unique values of dataset
titanic_dataset['Embarked'].unique()
titanic_dataset['Sex'].unique()
# Replace NAs with mean
titanic_dataset.fillna(titanic_dataset.mean(), inplace=True)
# convert some column into integer for representation in
# scatter matrix
titanic_dataset["Sex"] = titanic_dataset["Sex"].cat.codes
titanic_dataset["Embarked"] = titanic_dataset["Embarked"].cat.codes
titanic_dataset.head()
# plot scatter matrix using pandas and matplotlib
survive_colors = {0:'orange', 1:'blue'}
pd.plotting.scatter_matrix(titanic_dataset,figsize=(20,20),grid=True,
marker='o', c= titanic_dataset['Survived'].map(colors))
# plot scatter matrix using seaborn
sns.set_theme(style="ticks")
sns.pairplot(titanic_dataset, hue='Survived')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
PassengerId int64
Survived int64
Pclass int64
Sex object
Age float64
SibSp int64
Parch int64
Fare float64
Embarked object
dtype: object
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 1 34.5 0 0 7.8292 1
1 1 3 0 47.0 1 0 7.0000 2
2 0 2 1 62.0 0 0 9.6875 1
3 0 3 1 27.0 0 0 8.6625 2
4 1 3 0 22.0 1 1 12.2875 2
Matplotlib Scatter matrix
Seaborn Scatter matrixReferences:
Similar Reads
Matplotlib Scatter Scatter plots are one of the most fundamental and powerful tools for visualizing relationships between two numerical variables. matplotlib.pyplot.scatter() plots points on a Cartesian plane defined by X and Y coordinates. Each point represents a data observation, allowing us to visually analyze how
5 min read
ML | Matrix plots in Seaborn Seaborn is a wonderful visualization library provided by python. It has several kinds of plots through which it provides the amazing visualization capabilities. Some of them include count plot, scatter plot, pair plots, regression plots, matrix plots and much more. This article deals with the matrix
4 min read
Problem Solving on Scatter Matrix A scatter matrix, also known as a pair plot, is a powerful visualization tool in data analysis. It provides a grid of scatter plots that display relationships between pairs of variables in a dataset, helping engineers and data scientists to identify patterns, correlations, and potential outliers. Re
5 min read
What Is a Scatter Plot in Python? Scatter plots are a fundamental tool in data visualization, providing a visual representation of the relationship between two variables. In Python, scatter plots are commonly created using libraries such as Matplotlib and Seaborn. This article will delve into the concept of scatter plots, their appl
6 min read
Matplotlib Tutorial Matplotlib is an open-source visualization library for the Python programming language, widely used for creating static, animated and interactive plots. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, Qt, GTK and wxPython. It
5 min read
Corrplot Package in R A corrplot package is a powerful R Programming Language tool designed for intuitively and comprehensively visualizing correlation matrices. It offers a range of visualization techniques and customization options to effectively explore and communicate the relationships between variables in your data.
3 min read
Matplotlib Step-by-Step Guide Visualizing data helps us understand and share information more effectively. In Python Matplotlib is one of the best tools for creating visualizations. Itâs powerful, flexible and lets you make many types of plots, from simple line charts to advanced animations. This guide will help you step by step
7 min read
Customizing Marker Size in Pyplot Scatter Plots Scatter plots are a fundamental tool for visualizing the relationship between two variables. In Python, the matplotlib library provides a powerful function, pyplot.scatter(), to create scatter plots. One of the key aspects of scatter plots is the ability to customize marker sizes, which can add an a
4 min read
Getting Started with Plotly in R creationPlotly in R Programming Language allows the creation of interactive web graphics from 'ggplot2' graphs and a custom interface to the JavaScript library 'plotly.js' inspired by the grammar of graphics. InstallationTo use a package in R programming one must have to install the package first. T
5 min read
How to Plot Confusion Matrix with Labels in Sklearn? A confusion matrix is a table used to evaluate the performance of a classification algorithm. It compares the actual target values with those predicted by the model.. This article will explain us how to plot a labeled confusion matrix using Scikit-Learn. Before go to the implementation let's underst
4 min read