Python for Data Analytics
Lectures 1 & 2: The Python Language and Environment
Rodrigo Belo
[email protected]
Spring 2015
Introduction
Instructor
Rodrigo Belo
Researcher at Carnegie Mellon University and at Catlica-Lisbon, Portugal
PhD in Technological Change and Entrepreneurship from Carnegie
Mellon University
Research Interests: Social Networks and Technology on
Educational Settings
Background: Undergraduate degree in Computer Science and
Engineering, 5 years as Software Engineer
Email: [email protected]
Course Description
This course introduces Python as a tool to collect, process and analyze
large data sets from a variety of sources to create information that
guides businesses decision making
Course Description
Students will get familiarized with Python as a language and as a
platform to integrate different technologies and techniques for data
analytics, including:
Collection of online information;
Tools and strategies for data storage; and
Data analysis methods.
Course Description
Each class will start with the introduction of a concept or tool and end with
in-class hands-on exercises using example datasets.
Throughout the course students will apply these techniques to do their
Homework and their Term Project.
Learning Objectives
Upon completion of this course, the student will be able to:
1
Use Python as a general-purpose programming language
Collect data available online in an automated fashion
Process and store data in the appropriate format for future analysis
Apply data analytics tools to extract relevant information
Source Materials
Textbooks:
1
Main: McKinney (2012), Python for Data Analysis, OReilly
Other: Russel (2011), Mining the Social Web, OReilly
Online references:
1
Python 2 Documentation: https://docs.python.org/2/
pandas online reference:
http://pandas.pydata.org/pandas-docs/stable/
ggplot online reference: http://ggplot.yhathq.com
Grading
Individual Assignments: 40%
Assignments will be done by individual students and posted on Blackboard.
Specific assignments will appear approx. 1 week prior to due date.
Term Project: 30%
The term-project will be done in 2 or 3 person teams and will involve the
application of the methods mentioned in the class.
Students will identify a question they would like to answer using publicly
available data, gather the data from an online source, store it and analyze
it using some of the methods shown in class.
Final Exam: 30%
May 6, 6pm
Late Work
If a work is delivered t seconds late, its score is adjusted by multiplying it by
1
t
24 5 60 60
4
100
Maximum Grade
80
60
40
20
0
N. Days Late
10
Basic Concepts and Environment
11
Why Python?
Python is one of the most popular dynamic languages, along with Ruby,
Perl, R, and others
Python has a large and active scientific computing community
Adoption of Python has increased significantly since the 2000s both in
the industry and academic community
Python started as general purpose programming language but data
manipulation libraries make it a first class citizen in data manipulation
and analysis
Excellent choice as a single language for building data-centric
applications
12
Python as Glue
Python integrates easily with C, C++, and FORTRAN, languages in which
many routines are implemented
Most programs consist of small portions of code where most of the time is
spent, and large portions of glue code that doesnt run often
In many cases the execution time of glue code is irrelevant
Python can be used both as a prototyping language and as a
production language
13
Python Essentials
Some of the essential Python libraries and tools:
NumPy
SciPy
pandas
ggplot
IPython
14
Python Essentials: NumPy
NumPy (Numerical Python), is the foundational package for scientific
computing in Python. It provides, among other things:
A fast and efficient multidimensional array object: ndarray
Functions for performing element-wise computations with arrays or
mathematical operations between arrays
Linear algebra operations, Fourier transform, and random number
generation
Tools for integrating connecting C, C++, and Fortran code to Python
15
Python Essentials: SciPy
SciPy is a collection of packages addressing a number of different standard
problem domains in scientific computing:
scipy.integrate: numerical integration routines and differential
equation solvers
scipy.linalg: linear algebra routines and matrix decompositions
extending beyond those provided in numpy.linalg.
scipy.optimize: function optimizers (minimizers) and root finding
algorithms
scipy.signal: signal processing tools
scipy.sparse: sparse matrices and sparse linear system solvers
scipy.stats: standard continuous and discrete probability
distributions (density functions, samplers, continuous distribution
functions), various statistical tests, and more descriptive statistics
16
Python Essentials: pandas
pandas provides data structures and functions designed to make working
with structured data fast, easy and expressive
DataFrame is the primary object of this library
two dimensional object that resembles a table with rows and columns
meat[ : 5 ]
date beef veal pork lamb_and_mutton
0 1944-01-01
751
85 1280
1 1944-02-01
713
77 1169
2 1944-03-01
741
90 1128
3 1944-04-01
650
89
978
4 1944-05-01
681
106 1029
0
1
2
3
4
broilers
89
72
75
66
78
other_chicken
NaN
NaN
NaN
NaN
NaN
\
NaN
NaN
NaN
NaN
NaN
turkey
NaN
NaN
NaN
NaN
NaN
17
Python Essentials: ggplot
ggplot is a graphics library that allows for the creation of graphics very
easily
from ggplot import *
ggplot ( aes ( x= date , y= beef ) , data=meat) +\
geom_line ( ) +\
stat_smooth ( colour= blue , span=0.2)
3000
2500
beef
2000
1500
1000
500
0
1945
1955
1965
1975
date
1985
1995
2005
18
Python Essentials: IPython
IPython is the component that ties everything together. Aside from the
standard terminal, IPython shell provides:
IPython notebook: HTML notebook for connecting to IPython through
a web browser
GUI console with inline plotting, multiline editing and syntax
highlighting
Infrastructure for interactive parallel and distributed computing
19
Installation and Setup
Mac OS X and Linux distributions come with a Python distribution, but not
necessarily with all the required libraries
New users can install Anaconda (http://continuum.io/downloads) or
Canopy (https://store.enthought.com/downloads/)
To install IPython (and Python) follow the instructions on
http://ipython.org/install.html
You will need IPython notebook
20
Python 2 and Python 3
The Python community is currently undergoing a transition from the
Python 2 series of interpreters to the Python 3 series
Until the appearance of Python 3.0, all Python code was backwards
compatible
The community decided that in order to move the language forward,
certain backwards incompatible changes were necessary
21
Python 2 and Python 3
Python 3.x is a cleaned up version of Python 2.x
Many inconsistencies were removed in the new version
2.x: print "The answer is", 2*2
3.x: print("The answer is", 2*2)
More details at
http://nbviewer.ipython.org/github/rasbt/python_reference/blob/
master/tutorials/key_differences_between_python_2_and_3.ipynb
However, there is still a considerable amount code written in Python 2.x,
making it the de facto standard
In this course we will be using Python 2.x
22
Integrated Development Environments (IDEs)
There are many editors and IDEs that you can use to edit Python
PyDev (plugin for Eclipse)
Python Tools for Visual Studio
PyCharm
IPython notebook
Emacs
Vim
You can find more IDEs on
https://wiki.python.org/moin/IntegratedDevelopmentEnvironments
23
IPython: An Interactive Computing and
Development Environment
24
IPython Basics Prompt
$ ipython --pylab
Python 2.7.6 | 64-bit | (default, Jun 4 2014, 16:42:26)
Type "copyright", "credits" or "license" for more information.
IPython 2.1.0 -- An enhanced Interactive Python.
?
-> Introduction and overview of IPythons features.
%quickref -> Quick reference.
help
-> Pythons own help system.
object?
-> Details about object, use object?? for extra details.
Using matplotlib backend: MacOSX
In [1]: 3 + 4
Out[1]: 7
In [2]: data = {i : randn() for i in range(8)}
In [3]: data
Out[3]:
{0: 0.36680003627745555,
1: 0.5231034512314581,
2: 0.6300895261779402,
3: -0.9115682057027865,
4: -1.7244460134107902,
5: 0.3829479256814315,
6: 0.4718660373870812,
7: -0.23438875074129756}
In [4]: data[3]
Out[4]: -0.9115682057027865
25
IPython Basics Tab Completion
In [7]: da<Tab>
data
date2num
datestr2num
datetime
datetime64
datetime_as_string
datetime_data
In [7]: data
Out[7]:
{0: 0.0016908926460949773,
1: 0.39596065989527957,
2: -0.9295711814640477,
3: 2.1076302341719058,
4: -0.6391315204450737,
5: 1.7496783252859787,
6: -0.5307855278794061,
7: 0.38045583368270064}
26
IPython Basics Introspection
Using a question mark (?) before or after a variable will display some
general information about the object:
In [3]: b?
Type:
list
String form: [1, 2, 3, 45]
Length:
4
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterables items
? can also be used before or after a function name
27
IPython Basics Introspection
? has a final usage, which is for searching the IPython namespace in a
manner similar to the standard UNIX or Windows command line:
In [4]: import numpy as np
In [5]: np.*load*?
np.load
np.loads
np.loadtxt
np.pkgload
28
IPython Basics The %run Command
Any file can be run as a Python program inside the environment of your
IPython session using the %run command
# ipython_script_test.py
def my_function(x,y,z):
return (x + y) / z
aa = 5
%run ipython_script_test
print aa
print my_function (3.0 ,4 ,5)
5
1.4
29
IPython Basics The %paste Command
The %paste command pastes code copied to the clipboard keeping
indentation
The following code will not work if simply pasted:
x = 5
y = 7
if (x > 5):
x += 1
y = 8
>>> x = 5
y = 7
if (x > 5):
x += 1
y = 8
>>> ... ... >>> >>>
>>> y
8
>>> %paste
x = 5
y = 7
if (x > 5):
x += 1
y = 8
## -- End pasted text ->>> y
7
>>>
30
IPython Basics Interacting with the OS
IPython provides very strong integration with the operating system shell:
Command
output = !cmd args
%alias alias_name cmd
%bookmark
%cd directory
%pwd
%dirs
%dhist
%env
Description
Run cmd and store the stdout in output
Define an alias for a system (shell) command
Utilize IPythons directory bookmarking system
Change system working directory to passed directory
Return the current system working directory
Return a list containing the current directory stack
Print the history of visited directories
Return the system environment variables as a dict
31
IPython Basics IPython GUI
Starting an IPython GUI:
ipython qtconsole --pylab=inline
32
IPython Basics IPython Notebook
Starting the IPython notebook server:
ipython notebook --pylab=inline
33
Python Language
34
Python as a Calculator Basic Math
Python can be used as a basic calculator
Addition and subtraction
print 2 + 4
print 8.1 5
6
3.1
Multiplication
print 5 * 4
print 3.1 * 2
20
6.2
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
35
Python as a Calculator Basic Math
Integer division is not the same as float division
Float division
print 4.0 / 2.0
print 1.0/3.1
2.0
0.322580645161
Integer division
print 4 / 2
print 1/3
2
0
Careful when performing integer division
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
36
Python as a Calculator Basic Math
Exponentiation
print 3. ** 2
print 3**2
print 2 ** 0.5
9.0
9
1.41421356237
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
37
Advanced Mathematical Operations
Some more advanced mathematical operations require the numpy package
Square Root
import numpy as np
print np . sqrt (2)
1.41421356237
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
38
Exponential and logarithmic functions
Exponential
import numpy as np
print np . exp(1)
2.71828182846
Logarithm
import numpy as np
print np . log (10)
print np . log10 (10)
# base10
2.30258509299
1.0
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
39
Variable Assignment
The equal sign (=) is used to assign a value to a variable
width = 20
height = 5 * 9
width * height
900
40
Python Language Types
41
Boolean
Python has a built-in boolean type:
print width == 20
print width == 30
True
False
42
Strings
Strings can be enclosed in single quotes or double quotes
Single quotes
Hello World
Hello World
Isn \ t i t nice to have a computer that t a l k s to you?
"Isnt it nice to have a computer that talks to you?"
Double quotes
" Hello World"
Hello World
" Isn t i t nice to have a computer that t a l k s to you? "
"Isnt it nice to have a computer that talks to you?"
43
Strings
You can concatenate strings with the + sign:
" Hello " + "World"
HelloWorld
aa = " Hello "
bb = "World"
aa + bb
HelloWorld
44
Strings
Strings are immutable:
aa = " Hello " + "World"
print aa
aa[5] = R
HelloWorld
Traceback (most recent call last):
File "<ipython-input-4454-3668d02c561e>", line 3, in <module>
aa[5] = R
TypeError: str object does not support item assignment
45
Strings
You can use triple quotes for strings that span multiple lines
print " " " \
Hello
World " " "
Hello
----World
Triple quotes are often used to provide function documentation
46
Strings
Strings can be indexed (subscripted), with the first character having index 0
mystring = " Hello World"
print mystring [0]
print mystring [6:10]
H
Worl
There is no separate character type. A character is simply a string of size
one
47
Lists
Lists are a compound data type in Python
can be written as a list of comma-separated values (items) between
square brackets
might contain items of different types
squares = [1 , 4 , 9 , 16, 25]
squares
[1, 4, 9, 16, 25]
48
Lists
Lists can be indexed like strings
squares = [1 , 4 , 9 , 16, 25]
print squares [1]
print squares[3]
print squares[ 3:]
4
9
[9, 16, 25]
Lists are mutable (unlike strings)
l e t t e r s = [ a , b , c , d , e , f , g ]
print l e t t e r s
l e t t e r s [ 2 : 5 ] = [ C , D , E ] # replace some values
print l e t t e r s
l e t t e r s [ 2 : 5 ] = [ ] # now remove them
print l e t t e r s
[a, b, c, d, e, f, g]
[a, b, C, D, E, f, g]
[a, b, f, g]
49
Lists
Lists can be used as stacks:
stack = [3 , 4 , 5]
stack . append(6)
stack . append(7)
stack
[3, 4, 5, 6, 7]
stack . pop ( )
7
stack
[3, 4, 5, 6]
50
Tuples
A tuple is like a list but without being enclosed in brackets.
Tuples are immutable; you cannot change their values.
a = 3 , 4 , 5 , [7 , 8] , cat
print a [ 0 ] , a[1]
a[1] = dog
3 cat
Traceback (most recent call last):
File "<ipython-input-4538-8e67474f43ae>", line 3, in <module>
a[-1] = dog
TypeError: tuple object does not support item assignment
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
51
Sets
A set is an unordered collection with no duplicate elements
basket = [ apple , orange , apple , pear , orange , banana ]
f r u i t = set ( basket )
# create a set without duplicates
fruit
{apple, banana, orange, pear}
52
Dictionaries
A dictionary can be though as an unordered set of key : value pairs
phone_list = { jack : 4123324098, j i l l : 4120294139}
phone_list
{jack: 4123324098, jill: 4120294139}
phone_list [ rodrigo ] = 4120293473
phone_list
{jack: 4123324098, jill: 4120294139, rodrigo: 4120293473}
You can access all the keys and values of a dictionary:
print phone_list . keys ( )
print phone_list . values ( )
[rodrigo, jill, jack]
[4120293473, 4120294139, 4123324098]
53
Python Language Control Structures
54
Control Flows
if statements
x = 42
i f x > 10:
print x
else :
print 10
42
for statements
words = [ cat , window , defenestrate ]
for w in words :
print w, len (w)
cat 3
window 6
defenestrate 12
a = [ Mary , had , a , l i t t l e , lamb ]
for i in range ( len (a ) ) :
print i , a[ i ]
0
1
2
3
4
Mary
had
a
little
lamb
55
Python Language Functions
56
Defining Functions
You can create functions using the keyword def
def f ( x ) :
return x ** 3 np . log ( x )
print f (3)
print f ( 5 . 1 )
25.9013877113
131.02175946
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
57
Defining Functions
Functions can receive more than one argument
def func ( x , y ) :
" return product of x and y"
return x * y
print func (2 , 3)
6
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
58
Functions - Optional and Keyword Arguments
You can create default values for arguments:
def func ( a , n=2):
"compute the nth power of a"
return a ** n
# three d i f f e r e n t ways to c a l l the function
print func (2)
print func (2 , 3)
print func (2 , n=4)
4
8
16
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
59
Functions - Optional and Keyword Arguments
Defining a function with two optional arguments
def func (a=1, n=2):
"compute the nth power of a"
return a ** n
# three d i f f e r e n t ways to c a l l the function
print func ( )
print func (2 , 4)
print func (n=4, a=2)
1
16
16
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
60
Functions - Optional and Keyword Arguments
We can define that a function receives an arbitrary number of arguments
with the *args syntax:
def func ( * args ) :
sum = 0
for arg in args :
sum += arg
return sum
print func (1 , 2 , 3 , 4)
10
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
61
Functions - Optional and Keyword Arguments
We can define that a function receives an arbitrary number of keyword
arguments with the **kwargs syntax:
def func ( * * kwargs ) :
for kw in kwargs :
print {0} = {1} . format (kw, kwargs [kw] )
func ( t1=6, color= blue )
color = blue
t1 = 6
62
Lambda Functions
You can define "lambda" functions, which are also known as inline or
anonymous functions.
The syntax is lambda var:f(var)
print map(lambda x : x ** 2 , [0 , 1 , 2])
[0, 1, 4]
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
63
Nested Functions
You can nest functions inside of functions
def wrapper ( x ) :
a = 4
def func ( x , a ) :
return a * x
return func ( x , a)
print wrapper (5)
20
Source: John Kitchin <[email protected]> http://kitchingroup.cheme.cmu.edu/pycse/pycse.html
64
Functional Programming Tools
filter
def f ( x ) :
return x % 3 == 0 or x % 5 == 0
f i l t e r ( f , range (2 , 25))
[3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24]
map
def cube( x ) : return x * x * x
map(cube , range (10))
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
reduce
def add( x , y ) : return x+y
reduce (add , range (10))
45
65
List Comprehensions
List comprehensions provide a shortcut to create lists from existing
structures:
squares = [ x ** 2 for x in range (10)]
print squares
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
66
Python Language Class System
67
Class Objects
Class objects support two kinds of operations: attribute references and
instantiation.
class MyClass :
" " "A simple example class " " "
i = 12345
def f ( s e l f ) :
return hello world
x = MyClass ( )
print x . i
print x . f ( )
12345
hello world
68
Exercises
69