Basic Analysis using Python

SECTION 1
Descriptive Statistics
Summarising your Data
2

Data Snapshot
Data Descriptionbasic_salary_P3
3
The data has 41 rows and 7 columns
First_Name First Name
Last_Name Last Name
Grade Grade
Location Location
Function Department
ba Basic Allowance
ms Management
Supplements

Describing Variable
salary.describe()
ba ms
count 39.000000 37.000000
mean 17209.743590 11939.054054
std 4159.515241 3223.018305
min 10940.000000 2700.000000
25% 13785.000000 10450.000000
50% 16230.000000 12420.000000
75% 19305.000000 14200.000000
max 29080.000000 16970.000000
4
salary=pd.read_csv('basic_salary_P3.csv')
#Importing Data
#Checking the variable features using summary function
summary() gives descriptive measures for numeric variable

Measures of Central Tendency
print(salary.ba.mean())
17209.74
# Mean
mean(), gives mean of the variable.
print(salary.ba.median())
16230
# Median
median() gives median of the variable.
from scipy import stats
BasicAll=salary.ba.dropna(axis=0)
trimmed_mean= trim_mean(BasicAll, 0.1)
trimmed_mean
16879
Import stats from scipy.
Missing values are removed from ba
using dropna()
Here, trim_mean() is excluding 10%
observations from each side of the
data from the mean
print(salary.ba.mode())
NA
# Mode
mode() gives us the mode of the variable.
5

Measures of Variation
statistics.variance(BasicAll)
17301567
6
import statistics
statistics.stdev(BasicAll)
4159.515
# Standard Deviation
Import statistics library to use functions for
calculating standard deviation and variance
Use the BasicAll object created previously, for
calculating Standard deviation, variance and co-
efficient of variation.
stdev() gives standard deviation of the variable
var() gives variance of the variable
stats.variation(BasicAll)
0.23857
# Co-efficient of Variation
variation() from scipy.stats gives us the co-
efficient of variation.

Skewness and Kurtosis
stats.kurtosis(BasicAll, bias=False)
0.4996513
7
stats.skew(BasicAll, bias=False)
0.9033507
# Skewness
skew() gives skewness of the variable.
bias=False corrects the calculations for
statistical bias.
from scipy import stats Using package scipy to calculate skewness
and kurtosis.
# Kurtosis
kurtosis() gives kurtosis of the variable.

SECTION 2
Bivariate Analysis
8

Data Snapshot
The data has 25 rows and 6 columns
empno Employee Number
aptitude Aptitude Score of the
Employee
testofen Test of English
tech_ Technical Score
g_k_ General Knowledge Score
job_prof Job Proficiency Score
Data Description
job_proficiency_P3
9

Scatter Plot
10
import pandas as pd
import matplotlib as mlt
import matplotlib.pyplot as plt
job= pd.read_csv('job_proficiency_P3')
plt.scatter(job.aptitude,job.job_prof)
# Plotting Scatter plot
scatter() gives a scatterplot of
the two variables mentioned.
col= Argument to add colour

Pearson Correlation Coefficient
Pearson Correlation Coefficient 0.5144
There is positive relation between aptitude and job proficiency but
the relation is of moderate degree.
import numpy as np
np.corrcoef(job.aptitude,job.job_prof)
# Scatterplot
array([[ 1. , 0.51441069],
[ 0.51441069, 1. ]])
corrcoef gives the Pearson Correlation
Coefficient of the two variables mentioned

sns.lmplot('aptitude','job_prof',data=job);plt.xlabel('Aptitude');plt.yl
abel('Job Proficiency')
ScatterPlot with Regression Line
#Scatterplot of job proficiency against aptitude with Regression Line
12
#Importing Library Seaborn
import seaborn as sns
sns.lmplot Calls a scatter plot from sns object
plt.xlabel Defines the label on the X axis
Plt.ylabel Defines the label on the Y axis

13
OUT [3]:
ScatterPlot with Regression Line

Scatter Plot Matrix using
seabornpackage
14
sns.pairplot(job)
#ScatterPlot Matrix

SECTION 3
DataVisualisation
Graphs in Python
15

Data Snapshot
The data has 1000 rows and 10
columns
CustID Customer ID
Age Age
Gender Gender
PinCode PinCode
Active Whether the customer
was active in past 24
weeks or not
Calls Number of Calls made
Minutes Number of minutes
spoken
Amt Amount charged
AvgTime Mean Time per call
Age_Group Age Group of the
Customer
Data Descriptiontelecom_P3
16

Data Visualization
Data Visualization is possible thanks to matplotlib. It is a multiplatform visualization
tool built on top of Numpy that works with the SciPy library to create graphical models .
It provides the user with complete control over the graph and comes with two interfaces,
an object oriented style and a MATPLOT style.
matplotlib is fairly low level and can be cumbersome to use byitself, which is why
several libraries and wrappers exist on top of it's API such as Seaborn, Altair, Bokeh and
even pandas.
We will be using the pandas wrapper as a quick tool for visualizing our data and learn
about seaborn as we move on to higher level visualizations. However, the fact remains
that we will essentially working with matplotlib for both.
17

telecom_data=pd.read_csv('telecom_P3.csv')
import pandas as pd
import matplotlib as mlt
import matplotlib.pyplot as plt
import seaborn as sns
Diagrams
#Importing the Libraries
#Importing Data
18
#Aggregate & Merge Data
working=telecom_data.groupby('Age_Group')['CustID'].count()
Aggregating the CustID data by the age groups.

Simple Bar Chart
19
working.plot.bar(title='Simple Bar Chart')
#Create a basic bar chart using plot function
plot() This function is a convenience method to plot all columns
with labels
bar() Plots a bar chart. Can also be called by passing the
argument kind ='bar' in plot.
title A string argument to give the plot a title.

Simple Bar Chart
21
plt.figure(); working.plot.bar(title='Simple Bar Chart', color='red');
plt.xlabel('Age Groups'); plt.ylabel('No. of Calls')
#Customizing your chart using additional arguments (both provide the same results)
plt.figure() This function is a convenience method to
plot all columns with labels.
ax Matplotlib axes object containing the actual
plot (with data points).
color An argument to specify the plot colour.
Accepts strings, hex numbers and colour
code.
plt.xlabel,
ax.set_xlabel
Function/method to specify the x label.
plt.ylabel,
ax.set_ylabel
plt.figure(); ax=working.plot.bar(title='Simple Bar Chart', color='red');
ax.set_xlabel('Age Groups'); ax.set_ylabel('No. of Calls')
OR

Stacked Bar Chart
23
#Stacked Bar Chart
pivot_table Reshapes the data and aggregates according to function
specified. Here, we are aggregating the number of calls made by
gender and age group.
index The column or array to group by on the x axes (pivot table rows).
columns The column or array to group by on the y axes (pivot table
column).
values Column to aggregate
aggfunc Function to aggregate by.
stacked Returns a stacked chart. Default is False.
working2=pd.pivot_table(telecom_data, index=['Age_Group'],
columns=['Gender'], values=['CustID'], aggfunc='count')
plt.figure(); working2.plot.bar(title='Stacked Bar Chart', stacked=True);
plt.xlabel('Age Groups'); plt.ylabel('No. of Calls')

Stacked Bar Chart
24
OUT [11]:

Percentage Bar Chart
25
#Stacked Bar Chart
working3=working2.div(working2.sum(1).astype(float), axis=0)
plt.figure(); working3.plot.bar(title='Percentage Bar Chart',
stacked=True); plt.xlabel('Age Groups'); plt.ylabel('No. of Calls')
Creates percentage values by dividing the count data by column sum.
ax Matplotlib axes object contaning the actual plot (with data
points).
color An argument to specify the plot colour. Accepts strings, hex
numbers and colour code.
plt.xlabel,
ax.set_xlabe
l
plt.ylabel,
ax.set_ylabe
l

Percentage Bar Chart
26
OUT [13]:

Multiple Bar Chart
27
#Stacked Bar Chart
pivot_table Reshapes the data and aggregates according to function
specified.
index The column or array to group by on the x axes (pivot table rows).
columns The column or array to group by on the y axes (pivot table
column).
values Column to aggregate
aggfunction Function to aggregate by.
plt.figure(); working2.plot.bar(title='Multiple Bar Chart');
plt.xlabel('Age Groups'); plt.ylabel('No. of Customers')

Multiple Bar Chart
28
OUT [14]:

Pie Chart
29
working.plot.pie(label=('Age Groups'), colormap='brg')
#Pie Bar Chart
pie() Creates a pie chart
label Specifies the Label to be used
colormap String argument that specifies what colors to choose from

Box Plot
31
telecom_data.Calls.plot.box(label='No. Of Calls')
#BoxPlot
box() in pandas yields a different types of box chart
Calls specifies vector (column) for which the box plot needs to be plotted
label provides a user defined label for the variable on Y axis
color can be used to input your choice of color to the bars

Box Plot
33
telecom_data.boxplot(column='Calls', by='Age_Group', grid=False)
#BoxPlot using multiple variables. Here, we are plotting number of calls
by gender.
boxplot() in pandas yields a different types of box chart. It's a different way of
writing plot.box()
column specifies vector (variable) for which the box plot needs to be plotted
by Specifies the vector (column) by which the distribution should be plotted.
label provides a user defined label for the variable on Y axis
grid Can be used to remove the background grid seen in each plot

Histogram
35
telecom_data.Calls.hist(bins=12,grid=False)
#Histogram
hist() in base Python yields a histogram
bins specifies the width of each bar
label provides a user defined label for the variable on X and Y axis

Stem and Leaf Plot
37
plt.stem(telecom_data.Calls)
#Stem and Leaf Plot using matplotlib
stem() in matplotlib yields a stem and leaf chart
telecom_data.Ca
lls
specifies vector (variable) for which the stemplot needs to be plotted

Heat Map
38
plt.show; ax=sns.heatmap(agg);ax.set(xlabel='Gender', ylabel='Age
Group',title='Heatmap for Number of Calls by Age & Gender')
# Heat Map
ax Axes object returned by seaborn
heatmap() Seaborn method for creating a heatmap
ax.set Sets text data in the graph
linewidths Adds lines between each cell. Default is zero.
#Importing data and aggregating calls by gender and age group
agg=pd.pivot_table(telecom_data, index=['Age_Group'],
columns=['Gender'], values=['Calls'], aggfunc='sum')

Basic Analysis using Python

More Related Content

What's hot (20)

Similar to Basic Analysis using Python (20)

More from Sankhya_Analytics (8)

Recently uploaded (20)

Basic Analysis using Python

Editor's Notes