SlideShare a Scribd company logo
Using Pandas Library for Data Analysis in Python
Bruce Jenks
Assumptions
I will also assume that you have a basic understanding of Python and coding in general. You do not need to be
an expert programmer but will need to understand basic syntax and general coding standards. Lastly, although
not required, this is geared toward those in data science or those looking to get into a data science type role
such as data analyst or data scientist.
What is pandas?
“pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and
data analysis tools for the Python programming language. The best way to think about the pandas data
structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for
Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.” http://pandas.pydata.org/pandas-docs/stable/overview.html
Why pandas?
In short it is a game changer and the most widely used tools in data munging/wrangling. It can
take CSV files or even a SQL database and turn it into a Python object with rows and columns
called frame. Think of Excel but with the credibility of Python. pandas is the cornerstone of
handling tabular data in python, and that you use it for numerous data manipulation and
exploration tasks.
Installing Python
If you have used Python before you probably already have a shell or IDE installed. For this project I will be
using Juypter Notebook form Anaconda. This is not required but highly recommended. Below is a quick start
guide to help you get started:
Got to https://www.anaconda.com/download/ and download Python 3.6 for Windows Installer. Please see
anaconda.com on information for any issues you may experience.
Using Jupyter Notebook
Once Downloaded and Installed you will want to create a folder for this project. I am going to create a folder
directly on my desktop and call it my-notebook:
Once you create your folder hold down the shift key and click on it to bring up the menu
below. Click on Open PowerShell Window here. *Note you can also do this in CMD*
Type Jupyter notebook into PowerShell:
Hit enter and your browser should open and display the following page. Click New and select Python 3:
You should now have the following screen. Our folder is now linked to Jupyter which will allow
us to avoid numerous problems:
Getting Data
We are going need some data for our project so we will download some from Analytics Vidhya, which is the go
to site for those interested in machine learning, data analytics, and using advanced techniques for roles in
data science.
Go to https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
and download the csv file under Train File:
I will rename the file loan_data and save it to a folder on the C: called PythonData:
Import pandas
Before I get started it is worth to note some resources for Pandas. The best starting place to learn is at 10
Minutes to pandas. http://pandas.pydata.org/pandas-docs/stable/10min.html You may want to spend some
time here before going on with this tutorial. It is not required however. Once you are read type the following
into Python pay attention to the location of your csv file if different than mine:
Now that the dataset is load we can take a look at the top rows by using the function head()
You should now see the data below: The (10) should print 10 rows:
Let’s take a look at a summary of numerical variables by using describe():
We can make a few inferences with this data:
1. LoanAmount has 22 missing values. (614-592)
2. Loan_Amount_Term has 14 missing values. (614 – 600)
3. Credit_History has 50 missing values. (614-564)
4. Since Credit_History has a value of 1 for those with credit and a value of 0 for those with out we can
confirm that 84% of applicants do indeed have a Credit_History as established in the last column of the
mean row.
5. We can get a good idea of skew by comparing the mean to the median which is our 50% figure.
Now let’s take a look at the non-numerical values such as Property_Area and Credit_History by looking a
frequency distribution using df[‘Property_Area’].value_counts()
df[‘Credit_History’].value_counts():
With a basic idea of data characteristics, we can now look at some distribution. We will plot a histogram of
ApplicationIncome by typing df[‘ApplicantIncome’].hist(bins=50):
Distribution Analysis
We can clearly see that there are few extreme values. Note, this is the reason why 50 bins are required to
depict the distribution clearly.
We will take this a step further to better understand the distributions by creating a Box plot:
Type df.boxplot(column='ApplicantIncome'):
Our Box plot confirms the existence of outliers. Since we know that there is income disparity in our society let
us see what happens when we look at people of different education levels:
Type: df.boxplot(column='ApplicantIncome', by = 'Education')
The mean incomes of the two groups appear to be relatively equal. However, there are higher number of
graduates with extremely high incomes.
Let’s create a histogram and bloxplot of LoanAmount.
Type df[‘LoanAmount’].hist(bins=50) and then df.boxplot(column='LoanAmount')
We see again extreme values and see that LoanAmount has missing and extreme data as well.
ApplicantIncome has a few extreme values which well will dig deeper into with variable analysis.
Categorical Variable Analysis
Like may people I find Excel to be quick and easy and believe pivot tables are a great way to analyze data.
I admit I am crazy but I really like using VBA to create my PivotTables. However, I must admit that panda
makes this a little easier with simple code:
temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda
x:x.map({'Y':1,'N':0}).mean())
print ('Frequency Table for Credit History:')
print (temp1)
print ('nProbility of getting loan for each Credit History class:')
print (temp2)
Which looks like:
Now lets create a bar chart with our pivot table:
We can see that you are 8 times more likely to get a loan if you have valid credit.
Summary
This has been a getting started guide to using pandas in Python. It gives you an idea of how powerful a tool
Python is, especially with the right library. In the following weeks I plan on producing a more advanced look
into Python and pandas. We will investigate data munging and building a predictive model.

More Related Content

ODP
Ado Presentation
PDF
Portofolio-Gligor Maria
PDF
Micro project list dms- 22319
PDF
Habilitación variante 4
PDF
Well You Ought To Know (WYOTK) FP&A Automation Series
DOCX
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
PPTX
Data-Analytics using python (Module 4).pptx
DOCX
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
Ado Presentation
Portofolio-Gligor Maria
Micro project list dms- 22319
Habilitación variante 4
Well You Ought To Know (WYOTK) FP&A Automation Series
EN1320 Module 2 Lab 2.1Capturing the Reader’s InterestSelec.docx
Data-Analytics using python (Module 4).pptx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx

Similar to Using pandas library for data analysis in python (20)

PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
PDF
Hands-on with Apache Druid: Installation & Data Ingestion Steps
PDF
Document Based Data Modeling Technique
PDF
PDF
Oracle to vb 6.0 connectivity
PDF
Style Intelligence Evaluation Documentation
PDF
Data Wrangling and Visualization Using Python
PPTX
Implementing a data_science_project (Python Version)_part1
DOCX
Vipul divyanshu mahout_documentation
DOCX
unit 3.docx
PPTX
PATTERNS07 - Data Representation in C#
PDF
Nhibernate Part 1
PDF
Agile Data Science
PDF
Nt1310 Unit 3 Language Analysis
PDF
Library management project
PPTX
Basic of python for data analysis
PDF
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
DOCX
employee turnover prediction document.docx
PPTX
SQL to NoSQL: Top 6 Questions
Data Science With Python | Python For Data Science | Python Data Science Cour...
Lotusphere 2007 AD507 Leveraging the Power of Object Oriented Programming in ...
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Document Based Data Modeling Technique
Oracle to vb 6.0 connectivity
Style Intelligence Evaluation Documentation
Data Wrangling and Visualization Using Python
Implementing a data_science_project (Python Version)_part1
Vipul divyanshu mahout_documentation
unit 3.docx
PATTERNS07 - Data Representation in C#
Nhibernate Part 1
Agile Data Science
Nt1310 Unit 3 Language Analysis
Library management project
Basic of python for data analysis
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
employee turnover prediction document.docx
SQL to NoSQL: Top 6 Questions
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Computer network topology notes for revision
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
The Rise of Impact Investing- How to Align Profit with Purpose
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Understanding Prototyping in Design and Development
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PDF
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Computer network topology notes for revision
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
The Rise of Impact Investing- How to Align Profit with Purpose
Moving the Public Sector (Government) to a Digital Adoption
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
Understanding Prototyping in Design and Development
IB Computer Science - Internal Assessment.pptx
Fluorescence-microscope_Botany_detailed content
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Mega Projects Data Mega Projects Data
Ad

Using pandas library for data analysis in python

  • 1. Using Pandas Library for Data Analysis in Python Bruce Jenks Assumptions I will also assume that you have a basic understanding of Python and coding in general. You do not need to be an expert programmer but will need to understand basic syntax and general coding standards. Lastly, although not required, this is geared toward those in data science or those looking to get into a data science type role such as data analyst or data scientist. What is pandas? “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.” http://pandas.pydata.org/pandas-docs/stable/overview.html Why pandas? In short it is a game changer and the most widely used tools in data munging/wrangling. It can take CSV files or even a SQL database and turn it into a Python object with rows and columns called frame. Think of Excel but with the credibility of Python. pandas is the cornerstone of handling tabular data in python, and that you use it for numerous data manipulation and exploration tasks. Installing Python If you have used Python before you probably already have a shell or IDE installed. For this project I will be using Juypter Notebook form Anaconda. This is not required but highly recommended. Below is a quick start guide to help you get started: Got to https://www.anaconda.com/download/ and download Python 3.6 for Windows Installer. Please see anaconda.com on information for any issues you may experience.
  • 2. Using Jupyter Notebook Once Downloaded and Installed you will want to create a folder for this project. I am going to create a folder directly on my desktop and call it my-notebook: Once you create your folder hold down the shift key and click on it to bring up the menu below. Click on Open PowerShell Window here. *Note you can also do this in CMD* Type Jupyter notebook into PowerShell:
  • 3. Hit enter and your browser should open and display the following page. Click New and select Python 3: You should now have the following screen. Our folder is now linked to Jupyter which will allow us to avoid numerous problems:
  • 4. Getting Data We are going need some data for our project so we will download some from Analytics Vidhya, which is the go to site for those interested in machine learning, data analytics, and using advanced techniques for roles in data science. Go to https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/ and download the csv file under Train File: I will rename the file loan_data and save it to a folder on the C: called PythonData:
  • 5. Import pandas Before I get started it is worth to note some resources for Pandas. The best starting place to learn is at 10 Minutes to pandas. http://pandas.pydata.org/pandas-docs/stable/10min.html You may want to spend some time here before going on with this tutorial. It is not required however. Once you are read type the following into Python pay attention to the location of your csv file if different than mine: Now that the dataset is load we can take a look at the top rows by using the function head() You should now see the data below: The (10) should print 10 rows:
  • 6. Let’s take a look at a summary of numerical variables by using describe(): We can make a few inferences with this data: 1. LoanAmount has 22 missing values. (614-592) 2. Loan_Amount_Term has 14 missing values. (614 – 600) 3. Credit_History has 50 missing values. (614-564) 4. Since Credit_History has a value of 1 for those with credit and a value of 0 for those with out we can confirm that 84% of applicants do indeed have a Credit_History as established in the last column of the mean row. 5. We can get a good idea of skew by comparing the mean to the median which is our 50% figure. Now let’s take a look at the non-numerical values such as Property_Area and Credit_History by looking a frequency distribution using df[‘Property_Area’].value_counts()
  • 7. df[‘Credit_History’].value_counts(): With a basic idea of data characteristics, we can now look at some distribution. We will plot a histogram of ApplicationIncome by typing df[‘ApplicantIncome’].hist(bins=50): Distribution Analysis We can clearly see that there are few extreme values. Note, this is the reason why 50 bins are required to depict the distribution clearly.
  • 8. We will take this a step further to better understand the distributions by creating a Box plot: Type df.boxplot(column='ApplicantIncome'): Our Box plot confirms the existence of outliers. Since we know that there is income disparity in our society let us see what happens when we look at people of different education levels: Type: df.boxplot(column='ApplicantIncome', by = 'Education') The mean incomes of the two groups appear to be relatively equal. However, there are higher number of graduates with extremely high incomes.
  • 9. Let’s create a histogram and bloxplot of LoanAmount. Type df[‘LoanAmount’].hist(bins=50) and then df.boxplot(column='LoanAmount') We see again extreme values and see that LoanAmount has missing and extreme data as well. ApplicantIncome has a few extreme values which well will dig deeper into with variable analysis.
  • 10. Categorical Variable Analysis Like may people I find Excel to be quick and easy and believe pivot tables are a great way to analyze data. I admit I am crazy but I really like using VBA to create my PivotTables. However, I must admit that panda makes this a little easier with simple code: temp1 = df['Credit_History'].value_counts(ascending=True) temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x:x.map({'Y':1,'N':0}).mean()) print ('Frequency Table for Credit History:') print (temp1) print ('nProbility of getting loan for each Credit History class:') print (temp2) Which looks like:
  • 11. Now lets create a bar chart with our pivot table: We can see that you are 8 times more likely to get a loan if you have valid credit.
  • 12. Summary This has been a getting started guide to using pandas in Python. It gives you an idea of how powerful a tool Python is, especially with the right library. In the following weeks I plan on producing a more advanced look into Python and pandas. We will investigate data munging and building a predictive model.