Using pandas library for data analysis in python

Using Pandas Library for Data Analysis in Python
Bruce Jenks
Assumptions
I will also assume that you have a basic understanding of Python and coding in general. You do not need to be
an expert programmer but will need to understand basic syntax and general coding standards. Lastly, although
not required, this is geared toward those in data science or those looking to get into a data science type role
such as data analyst or data scientist.
What is pandas?
“pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and
data analysis tools for the Python programming language. The best way to think about the pandas data
structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for
Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.” http://pandas.pydata.org/pandas-docs/stable/overview.html
Why pandas?
In short it is a game changer and the most widely used tools in data munging/wrangling. It can
take CSV files or even a SQL database and turn it into a Python object with rows and columns
called frame. Think of Excel but with the credibility of Python. pandas is the cornerstone of
handling tabular data in python, and that you use it for numerous data manipulation and
exploration tasks.
Installing Python
If you have used Python before you probably already have a shell or IDE installed. For this project I will be
using Juypter Notebook form Anaconda. This is not required but highly recommended. Below is a quick start
guide to help you get started:
Got to https://www.anaconda.com/download/ and download Python 3.6 for Windows Installer. Please see
anaconda.com on information for any issues you may experience.

Using Jupyter Notebook
Once Downloaded and Installed you will want to create a folder for this project. I am going to create a folder
directly on my desktop and call it my-notebook:
Once you create your folder hold down the shift key and click on it to bring up the menu
below. Click on Open PowerShell Window here. *Note you can also do this in CMD*
Type Jupyter notebook into PowerShell:

Hit enter and your browser should open and display the following page. Click New and select Python 3:
You should now have the following screen. Our folder is now linked to Jupyter which will allow
us to avoid numerous problems:

Getting Data
We are going need some data for our project so we will download some from Analytics Vidhya, which is the go
to site for those interested in machine learning, data analytics, and using advanced techniques for roles in
data science.
Go to https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
and download the csv file under Train File:
I will rename the file loan_data and save it to a folder on the C: called PythonData:

Import pandas
Before I get started it is worth to note some resources for Pandas. The best starting place to learn is at 10
Minutes to pandas. http://pandas.pydata.org/pandas-docs/stable/10min.html You may want to spend some
time here before going on with this tutorial. It is not required however. Once you are read type the following
into Python pay attention to the location of your csv file if different than mine:
Now that the dataset is load we can take a look at the top rows by using the function head()
You should now see the data below: The (10) should print 10 rows:

Let’s take a look at a summary of numerical variables by using describe():
We can make a few inferences with this data:
1. LoanAmount has 22 missing values. (614-592)
2. Loan_Amount_Term has 14 missing values. (614 – 600)
3. Credit_History has 50 missing values. (614-564)
4. Since Credit_History has a value of 1 for those with credit and a value of 0 for those with out we can
confirm that 84% of applicants do indeed have a Credit_History as established in the last column of the
mean row.
5. We can get a good idea of skew by comparing the mean to the median which is our 50% figure.
Now let’s take a look at the non-numerical values such as Property_Area and Credit_History by looking a
frequency distribution using df[‘Property_Area’].value_counts()

df[‘Credit_History’].value_counts():
With a basic idea of data characteristics, we can now look at some distribution. We will plot a histogram of
ApplicationIncome by typing df[‘ApplicantIncome’].hist(bins=50):
Distribution Analysis
We can clearly see that there are few extreme values. Note, this is the reason why 50 bins are required to
depict the distribution clearly.

We will take this a step further to better understand the distributions by creating a Box plot:
Type df.boxplot(column='ApplicantIncome'):
Our Box plot confirms the existence of outliers. Since we know that there is income disparity in our society let
us see what happens when we look at people of different education levels:
Type: df.boxplot(column='ApplicantIncome', by = 'Education')
The mean incomes of the two groups appear to be relatively equal. However, there are higher number of
graduates with extremely high incomes.

Let’s create a histogram and bloxplot of LoanAmount.
Type df[‘LoanAmount’].hist(bins=50) and then df.boxplot(column='LoanAmount')
We see again extreme values and see that LoanAmount has missing and extreme data as well.
ApplicantIncome has a few extreme values which well will dig deeper into with variable analysis.

Categorical Variable Analysis
Like may people I find Excel to be quick and easy and believe pivot tables are a great way to analyze data.
I admit I am crazy but I really like using VBA to create my PivotTables. However, I must admit that panda
makes this a little easier with simple code:
temp1 = df['Credit_History'].value_counts(ascending=True)
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda
x:x.map({'Y':1,'N':0}).mean())
print ('Frequency Table for Credit History:')
print (temp1)
print ('nProbility of getting loan for each Credit History class:')
print (temp2)
Which looks like:

Now lets create a bar chart with our pivot table:
We can see that you are 8 times more likely to get a loan if you have valid credit.

Summary
This has been a getting started guide to using pandas in Python. It gives you an idea of how powerful a tool
Python is, especially with the right library. In the following weeks I plan on producing a more advanced look
into Python and pandas. We will investigate data munging and building a predictive model.

Using pandas library for data analysis in python

More Related Content

Similar to Using pandas library for data analysis in python (20)

Recently uploaded (20)

Using pandas library for data analysis in python