Unit - III
Data Manipulation using Python
Here we are going to see the techniques for effectively loading, storing, and
manipulating in-memory data in Python.
The topic is very broad: datasets can come from a wide range of sources and a wide range
of formats, including collections of documents, collections of images, collections of
sound clips, collections of numerical measurements, etc.
We think of all data fundamentally as arrays of numbers.
For example, images—particularly digital images—can be thought of as simply two
dimensional arrays of numbers representing pixel across the area.
Sound clips can be thought of as one-dimensional arrays of intensity versus time.
Text can be converted in various ways into numerical representations, perhaps binary
digits.
For this reason, efficient storage and manipulation of numerical arrays is absolutely
fundamental to the process of doing data science.
The specialized tools that Python has for handling such numerical arrays: the NumPy
package and the Pandas package.
The Basics of NumPy Arrays
A few categories of basic array manipulations are:
Attributes of arrays: Determining the size, shape, memory consumption, and data types
of arrays
Indexing of arrays: Getting and setting the value of individual array elements
Slicing of arrays: Getting and setting smaller subarrays within a larger array
Reshaping of arrays: Changing the shape of a given array
Joining and splitting of arrays: Combining multiple arrays into one, and splitting one
array into many
NumPy Array Attributes
Here we define three random arrays, a one-dimensional, two-dimensional, and three-
dimensional array.
Here we use NumPy's random number generator, which we will seed with a set value
in order to ensure that the same random arrays are generated each time this code is
run:
import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
Each array has attributes including ndim (the number of dimensions), shape (the size of
each dimension), size (the total size of the array), and dtype (the type of each element):
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype: ", x3.dtype)
Output:
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
dtype: int64
Array Indexing: Accessing Single Elements
In a one-dimensional array, the ith value (counting from zero) can be accessed by
specifying the desired index in square brackets, just as with Python lists:
x1
array([9, 4, 0, 3, 8, 6])
x1[0]
9
x1[4]
8
To index from the end of the array, you can use negative indices:
x1[-1]
6
x1[-2]
8
In a multidimensional array, items can be accessed using a comma-separated (row,
column) tuple:
x2
array([[3, 1, 3, 7],
[4, 0, 2, 3],
[0, 0, 6, 9]])
x2[0, 0]
3
x2[2, 0]
0
x2[2, -1]
9
Values can also be modified using any of the preceding index notation:
x2[0, 0] = 12
x2
array([[12, 1, 3, 7],
[4, 0, 2, 3],
[0, 0, 6, 9]])
Unlike Python lists, NumPy arrays have a fixed type.
x1[0] = 3.14159 # this will be truncated!
x1
array([3, 4, 0, 3, 8, 6])
Array Slicing: Accessing Subarrays
We can use square brackets to access individual array elements, we can also use them to
access subarrays with the slice notation, marked by the colon (:) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an
array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=<size of
dimension>, step=1.
One-Dimensional Subarrays
x1
array([9, 4, 0, 3, 8, 6])
x1[:3] # first three elements
array([3, 4, 0])
x1[1:4] # middle subarray
array([4, 0, 3])
x1[1::2] # every second element, starting at index 1
array([4, 3, 6])
x1[::-1] # all elements, reversed
array([6, 8, 3, 0, 4, 3])
Multidimensional Subarrays
x2
array([[12, 1, 3, 7],
[4, 0, 2, 3],
[0, 0, 6, 9]])
x2[:2, :3] # first two rows & three columns
array([[12, 1, 3],
[ 4, 0, 2]])
x2[:3, ::2] # three rows, every second column
array([[12, 3],
[ 4, 2],
[ 0, 6]])
x2[::-1, ::-1] # all rows & columns, reversed
array([[ 9, 6, 0, 0],
[ 3, 2, 0, 4],
[ 7, 3, 1, 12]])
Accessing array rows and columns
One commonly needed routine is accessing single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a
single colon (:):
x2[:, 0] # first column of x2
array([12, 4, 0])
x2[0, :] # first row of x2
array([12, 1, 3, 7])
Let's extract a 2×2 subarray from this:
x2_sub = x2[:2, :2]
print(x2_sub)
[[12 1]
[ 4 0]]
x2_sub[0, 0] = 99
print(x2_sub)
[[99 1]
[ 4 0]]
print(x2)
[[99 1 3 7]
[ 4 0 2 3]
[ 0 0 6 9]]
Creating Copies of Arrays
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)
[[99 1]
[ 4 0]]
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 1]
[ 4 0]]
print(x2)
[[99 1 3 7]
[ 4 0 2 3]
[ 0 0 6 9]]
Reshaping of Arrays
grid = np.arange(1, 10).reshape(3, 3)
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
x = np.array([1, 2, 3])
x.reshape((1, 3)) # row vector via reshape
array([[1, 2, 3]])
x.reshape((3, 1)) # column vector via reshape
array([[1],
[2],
[3]])
Concatenation of Arrays
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
array([1, 2, 3, 3, 2, 1])
z = np.array([99, 99, 99])
print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]
Splitting of Arrays
The opposite of concatenation is splitting, which is implemented by the
functions np.split, np.hsplit, and np.vsplit.
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Aggregation
NumPy has fast built-in aggregation functions for working on arrays.
Function Name Description
np.sum Compute sum of elements
np.prod Compute product of elements
np.mean Compute mean of elements
np.std Compute standard deviation
np.var Compute variance
np.min Find minimum value
np.max Find maximum value
np.argmin Find index of minimum value
np.argmax Find index of maximum value
np.median Compute median of elements
np.percentile Compute rank-based statistics of elements
np.any Evaluate whether any elements are true
np.all Evaluate whether all elements are true
Example: What are the Average Height of US Presidents?
This data is available in the file president_heights.csv, which is a simple comma-
separated list of labels and values:
!head -4 data/president_heights.csv
order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
177 185 188 188 182 185]
print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())
Mean height: 179.738095238
Standard deviation: 6.93184344275
Minimum height: 163
Maximum height: 193
print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))
25th percentile: 174.25
Median: 182.0
75th percentile: 183.0
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # set plot style
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');
Computation on NumPy Arrays:
Universal Functions
Computation on NumPy arrays can be very fast, or it can be very slow.
The key to making it fast is to use vectorized operations, generally implemented through
NumPy's universal functions (ufuncs).
The following table lists the arithmetic operators implemented in NumPy:
Operator Equivalent ufunc Description
+ np.add Addition (e.g., 1 + 1 = 2)
- np.subtract Subtraction (e.g., 3 - 2 = 1)
- np.negative Unary negation (e.g., -2)
* np.multiply Multiplication (e.g., 2 * 3 = 6)
/ np.divide Division (e.g., 3 / 2 = 1.5)
// np.floor_divide Floor division (e.g., 3 // 2 = 1)
** np.power Exponentiation (e.g., 2 ** 3 = 8)
% np.mod Modulus/remainder (e.g., 9 % 4 = 1)
NumPy provides a large number of useful ufuncs, and some of the most useful for the
data scientist are the trigonometric functions.
Another common type of operation available in a NumPy ufunc are the exponents and
logarithms.
Broadcasting
Another means of vectorizing operations is to use NumPy's broadcasting functionality.
Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition,
subtraction, multiplication, etc.) on arrays of different sizes.
import numpy as np
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b
array([5, 6, 7])
Broadcasting allows these types of binary operations to be performed on arrays of
different sizes.
a + 5
array([5, 6, 7])
M = np.ones((3, 3))
M
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
M + a
array([[ 1., 2., 3.],
[ 1., 2., 3.],
[ 1., 2., 3.]])
Comparisons, Masks, and Boolean Logic
The use of Boolean masks is to examine and manipulate values within NumPy arrays.
Masking comes up when you want to extract, modify, count, or otherwise manipulate
values in an array based on some criterion.
For example, you might wish to count all values greater than a certain value, or perhaps
remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of
tasks.
NumPy also implements comparison operators such as < (less than) and > (greater than)
as element-wise ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:
x = np.array([1, 2, 3, 4, 5])
x < 3 # less than
array([ True, True, False, False, False], dtype=bool)
x > 3 # greater than
array([False, False, False, True, True], dtype=bool)
x <= 3 # less than or equal
array([ True, True, True, False, False], dtype=bool)
x >= 3 # greater than or equal
array([False, False, True, True, True], dtype=bool)
x != 3 # not equal
array([ True, True, False, True, True], dtype=bool)
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
It is also possible to do an element-wise comparison of two arrays, and to include
compound expressions:
(2 * x) == (x ** 2)
array([False, True, False, False, False], dtype=bool)
A summary of the comparison operators and their equivalent ufunc is shown here:
Working with Boolean Arrays
Given a Boolean array, there are a host of useful operations you can do.
print(x)
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
# how many values less than 6?
np.count_nonzero(x < 6)
8
np.sum(x < 6)
8
# are there any values greater than 8?
np.any(x > 8)
True
# are there any values less than zero?
np.any(x < 0)
False
Boolean operators
print("Number days without rain: ", np.sum(inches == 0))
print("Number days with rain: ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches :", np.sum((inches > 0) &
(inches < 0.2)))
Number days without rain: 215
Number days with rain: 150
Days with more than 0.5 inches: 37
Rainy days with < 0.2 inches : 75
Boolean Arrays as Masks
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of
the data themselves.
Returning to our x array from before, suppose we want an array of all values in the array
that are less than, say, 5:
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
x < 5
array([[False, True, True, True],
[False, False, True, False],
[ True, True, False, False]], dtype=bool)
Now to select these values from the array, we can simply index on this Boolean array;
this is known as a masking operation:
x[x < 5]
array([0, 3, 3, 3, 2, 4])
Fancy Indexing
We saw how to access and modify portions of arrays using simple indices (e.g., arr[0]),
slices (e.g., arr[:5]), and Boolean masks (e.g., arr[arr > 0]).
Here, we'll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we've already seen, but we pass arrays of
indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's
values.
Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access
multiple array elements at once.
For example, consider the following array:
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
Suppose we want to access three different elements. We could do it like this:
[x[3], x[7], x[2]]
[71, 86, 14]
Alternatively, we can pass a single list or array of indices to obtain the same result:
ind = [3, 7, 4]
x[ind]
array([71, 86, 60])
When using fancy indexing, the shape of the result reflects the shape of the index
arrays rather than the shape of the array being indexed:
ind = np.array([[3, 7],
[4, 5]])
x[ind]
array([[71, 86],
[60, 20]])
Fancy indexing also works in multiple dimensions. Consider the following array:
X = np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other
indexing schemes we've seen:
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
We can combine fancy and simple indices:
X[2, [2, 0, 1]]
array([10, 8, 9])
We can also combine fancy indexing with slicing:
X[1:, [2, 0, 1]]
array([[ 6, 4, 5],
[10, 8, 9]])
And we can combine fancy indexing with masking:
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
The Fancy Indexing is used for,
1. Selecting Random Points
2. Modifying Values
3. Binning Data
Structured Data:
NumPy's Structured Arrays
While often our data can be well represented by a homogeneous array of values.
Here we demonstrates the use of NumPy's structured arrays and record arrays, which
provide efficient storage for compound, heterogeneous data.
While the patterns shown here are useful for simple operations, scenarios like this often
lend themselves to the use of Pandas Dataframes.
import numpy as np
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
x = np.zeros(4, dtype=int)
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
We've created an empty container array, we can fill the array with our lists of values:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)
('Doug', 19, 61.5)]
The handy thing with structured arrays is that you can now refer to values either by index
or by name:
# Get all names
data['name']
array(['Alice', 'Bob', 'Cathy', 'Doug'],
dtype='<U10')
# Get first row of data
data[0]
('Alice', 25, 55.0)
# Get the name from the last row
data[-1]['name']
'Doug'
Creating Structured Arrays
Structured array data types can be specified in a number of ways.
Earlier, we saw the dictionary method:
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
For clarity, numerical types can be specified using Python types or NumPy dtypes
instead:
np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])
A compound type can also be specified as a list of tuples:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])