The Glowing Python: scipy

Showing posts with label scipy. Show all posts

Thursday, August 16, 2012

Kernel Density Estimation with scipy

This post continues the last one where we have seen how to how to fit two types of distribution functions (Normal and Rayleigh). This time we will see how to use Kernel Density Estimation (KDE) to estimate the probability density function. KDE is a non-parametric technique for density estimation in which a known density function (the kernel) is averaged across the observed data points to create a smooth approximation. Given the non-parametrica nature of KDE, the main estimator has not a fixed functional form but only it depends upon all the data points we used for the estimation. Let's see the snippet:

from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist

# creating data with two peaks
sampD1 = norm.rvs(loc=-1.0,scale=1,size=300)
sampD2 = norm.rvs(loc=2.0,scale=0.5,size=300)
samp = hstack([sampD1,sampD2])

# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(samp)

# plotting the result
x = linspace(-5,5,100)
plot(x,my_pdf(x),'r') # distribution function
hist(samp,normed=1,alpha=.3) # histogram
show()

The result should be as follows:

Friday, July 20, 2012

Distribution fitting with scipy

Distribution fitting is the procedure of selecting a statistical distribution that best fits to a dataset generated by some random process. In this post we will see how to fit a distribution using the techniques implemented in the Scipy library.
This is the first snippet:

from scipy.stats import norm
from numpy import linspace
from pylab import plot,show,hist,figure,title

# picking 150 of from a normal distrubution
# with mean 0 and standard deviation 1
samp = norm.rvs(loc=0,scale=1,size=150) 

param = norm.fit(samp) # distribution fitting

# now, param[0] and param[1] are the mean and 
# the standard deviation of the fitted distribution
x = linspace(-5,5,100)
# fitted distribution
pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = norm.pdf(x)

title('Normal distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()

The result should be as follows

In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar.
Let's do the same with a Rayleigh distribution:

from scipy.stats import norm,rayleigh

samp = rayleigh.rvs(loc=5,scale=2,size=150) # samples generation

param = rayleigh.fit(samp) # distribution fitting

x = linspace(5,13,100)
# fitted distribution
pdf_fitted = rayleigh.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = rayleigh.pdf(x,loc=5,scale=2)

title('Rayleigh distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()

The resulting plot:

As expected, the two distributions are very close.

Sunday, July 8, 2012

Color quantization

The aim of color clustering is to produce a small set of representative colors which captures the color properties of an image. Using the small set of color found by the clustering, a quantization process can be applied to the image to find a new version of the image that has been "simplified," both in colors and shapes.
In this post we will see how to use the K-Means algorithm to perform color clustering and how to apply the quantization. Let's see the code:

from pylab import imread,imshow,figure,show,subplot
from numpy import reshape,uint8,flipud
from scipy.cluster.vq import kmeans,vq

img = imread('clearsky.jpg')

# reshaping the pixels matrix
pixel = reshape(img,(img.shape[0]*img.shape[1],3))

# performing the clustering
centroids,_ = kmeans(pixel,6) # six colors will be found
# quantization
qnt,_ = vq(pixel,centroids)

# reshaping the result of the quantization
centers_idx = reshape(qnt,(img.shape[0],img.shape[1]))
clustered = centroids[centers_idx]

figure(1)
subplot(211)
imshow(flipud(img))
subplot(212)
imshow(flipud(clustered))
show()

The result shoud be as follows:

We have the original image on the top and the quantized version on the bottom. We can see that the image on the bottom has only six colors. Now, we can plot the colors found with the clustering in the RGB space with the following code:

# visualizing the centroids into the RGB space
from mpl_toolkits.mplot3d import Axes3D
fig = figure(2)
ax = fig.gca(projection='3d')
ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],c=centroids/255.,s=100)

show()

And this is the result:

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:

for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.

These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

The result should be as follows:

In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:

# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or',
     data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()

This time the the result is as follows:

Saturday, March 24, 2012

Linear regression with Numpy

Few post ago, we have seen how to use the function numpy.linalg.lstsq(...) to solve an over-determined system. This time, we'll use it to estimate the parameters of a regression line.
A linear regression line is of the form w₁x+w₂=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (x_i, y_i), the parameters that we are looking for are w₁ and w₂ which minimize the error

and we can compute the parameter vector w = (w₁ , w₂)^T as the least-squares solution of the following over-determined system

Let's use numpy to compute the regression line:

from numpy import arange,array,ones,linalg
from pylab import plot,show

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]
w = linalg.lstsq(A.T,y)[0] # obtaining the parameters

# plotting the line
line = w[0]*xi+w[1] # regression line
plot(xi,line,'r-',xi,y,'o')
show()

We can see the result in the plot below.

You can find more about data fitting using numpy in the following posts:

Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):

from numpy import arange,array,ones#,random,linalg
from pylab import plot,show
from scipy import stats

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]

slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)

print 'r value', r_value
print  'p_value', p_value
print 'standard deviation', std_err

line = slope*xi+intercept
plot(xi,line,'r-',xi,y,'o')
show()

Wednesday, September 14, 2011

Uncertainty principle and spectrogram with pylab

The Fourier transform does not give any information on the time at which a frequency component occurs. One approach which can give information on the time resolution of the spectrum is the Short Time Fourier Transform (STFT). Here a moving window is applied to the signal and the Fourier transform is applied to the signal within the window as the window is moved [Ref]. The choice of window is very important with respect to the performance of the STFT in practice. Since the STFT is simply applying the Fourier transform to pieces of the time series of interest, a drawback of the STFT is that it will not be able to resolve events if they happen to appear within the width of the window. In this case, the lack of time resolution of the Fourier transform is present. In general, one cannot achieve simultaneous time and frequency resolution because of the Heisenberg uncertain principle. In the field of particle physics, an elementary particle does not have precise position and momentum. The better one determines the position of the particle, the less precisely is know at that time, and vice versa. For signal processing, this rule translates into the fact that a signal does not simultaneously have a precise location in time and precise frequency [Ref].

The library pylab provides the function specgram(...) to compute the spectrogram of a signal using the STFT. The following script uses that function to show the spectrogram of a signal with different windows size:

from scipy.io.wavfile import read,write
from pylab import plot,show,subplot,specgram

# Open the Homer Simpson voice: "Ummm, Crumbled up cookie things."
# from http://www.thesoundarchive.com/simpsons/homer/mcrumble.wav
rate,data = read('mcrumble.wav') # reading

subplot(411)
plot(range(len(data)),data)
subplot(412)
# NFFT is the number of data points used in each block for the FFT
# and noverlap is the number of points of overlap between blocks
specgram(data, NFFT=128, noverlap=0) # small window
subplot(413)
specgram(data, NFFT=512, noverlap=0) 
subplot(414)
specgram(data, NFFT=1024, noverlap=0) # big window

show()

This image is the result:

The pictures shows that changing the number of data points used for each Fourier transform block, the spectrogram loses definition in frequency or in the time.

Thursday, September 8, 2011

Sound Synthesis

Physically, sound is an oscillation of a mechanical medium that makes the surrounding air also oscillate and transport the sound as a compression wave. Mathematically, the oscillations can be described as

where t is the time, and f the frequency of the oscillation. Sound on a computer is a sequence of numbers and in this post we will see how to generate a musical tone with numpy and how to write it to a file wav file. Each musical note vibrates at a particular frequency and the following script contains a function to generate a note (tone(...)), we'll use this function to generate the A tone creating an oscillation with f = 440 Hz.

from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16
from pylab import plot,show,axis

# tone synthesis
def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

# A tone, 2 seconds, 44100 samples per second
tone = note(440,2,amp=10000)

write('440hzAtone.wav',44100,tone) # writing the sound to a file

plot(linspace(0,2,2*44100),tone)
axis([0,0.4,15000,-15000])
show()

Now we can play the file 440hzAtone.wav with an external player. This plot shows a part of the signal generated by the script:

And using the plotSpectrum function defined in a previous post we can verify that 440 Hz is the fundamental frequency of the tone.

Wednesday, August 3, 2011

How to plot the frequency spectrum with scipy

Spectrum analysis is the process of determining the frequency domain representation of a time domain signal and most commonly employs the Fourier transform. The Discrete Fourier Transform (DFT) is used to determine the frequency content of signals and the Fast Fourier Transform (FFT) is an efficient method for calculating the DFT. Scipy implements FFT and in this post we will see a simple example of spectrum analysis:

from numpy import sin, linspace, pi
from pylab import plot, show, title, xlabel, ylabel, subplot
from scipy import fft, arange

def plotSpectrum(y,Fs):
 """
 Plots a Single-Sided Amplitude Spectrum of y(t)
 """
 n = len(y) # length of the signal
 k = arange(n)
 T = n/Fs
 frq = k/T # two sides frequency range
 frq = frq[range(n/2)] # one side frequency range

 Y = fft(y)/n # fft computing and normalization
 Y = Y[range(n/2)]
 
 plot(frq,abs(Y),'r') # plotting the spectrum
 xlabel('Freq (Hz)')
 ylabel('|Y(freq)|')

Fs = 150.0;  # sampling rate
Ts = 1.0/Fs; # sampling interval
t = arange(0,1,Ts) # time vector

ff = 5;   # frequency of the signal
y = sin(2*pi*ff*t)

subplot(2,1,1)
plot(t,y)
xlabel('Time')
ylabel('Amplitude')
subplot(2,1,2)
plotSpectrum(y,Fs)
show()

The program shows the following figure, on top we have a plot of the signal and on the bottom the frequency spectrum.

Thursday, June 2, 2011

Integration of a function using quad from scipy

Example of integration of a function using scipy:

from scipy.integrate import quad

f = lambda x: 4-x*x # function to integrate
p = lambda x: (-x*x*x)/3 + 4*x # the primitive of f

# let's try to integrate f in the interval [0 2]
r = quad(f,0,2)   # integration using quad
r_analytic = p(2.0)-p(0.0) # analytic integration

print 'Results'
print r[0]
print r_analytic

Numerical and analytic result compared:

Results
5.33333333333
5.33333333333

Saturday, May 28, 2011

Data fitting using fmin

We have seen already how to find the minimum of a function using fmin, in this example we will see how use it to fit a set of data with a curve minimizing an error function:

from pylab import *
from numpy import *
from numpy.random import normal
from scipy.optimize import fmin

# parametric function, x is the independent variable
# and c are the parameters.
# it's a polynomial of degree 2
fp = lambda c, x: c[0]+c[1]*x+c[2]*x*x
real_p = rand(3)

# error function to minimize
e = lambda p, x, y: (abs((fp(p,x)-y))).sum()

# generating data with noise
n = 30
x = linspace(0,1,n)
y = fp(real_p,x) + normal(0,0.05,n)

# fitting the data with fmin
p0 = rand(3) # initial parameter value
p = fmin(e, p0, args=(x,y))

print 'estimater parameters: ', p
print 'real parameters: ', real_p

xx = linspace(0,1,n*3)
plot(x,y,'bo', xx,fp(real_p,xx),'g', xx, fp(p,xx),'r')

show()

The following figure will be showed, in green the original curve used to generate the noisy data, in blue the noisy data and in red the curve found in the minimization process:

The parameters will be printed also:

Optimization terminated successfully.
         Current function value: 0.861885
         Iterations: 77
         Function evaluations: 146
estimater parameters:  [ 0.92504602  0.87328979  0.64051926]
real parameters:  [ 0.86284356  0.95994753  0.67643758]

Tuesday, May 17, 2011

How to interpolate a set of points

The purpose of this example is to show how to interpolate a set of points (x,y) using the funtion interp1 provided by scipy.

import scipy.interpolate as sp
import numpy
import pylab

# 50 points of sin(x) in [0 10]
xx = numpy.linspace(0, 10, 50)
yy = numpy.sin(xx)

# 10 sample of sin(x) in [0 10]
x = numpy.linspace(0, 10, 10)
y = numpy.sin(x)

# interpolation
fl = sp.interp1d(x, y,kind='linear')
fc = sp.interp1d(x, y,kind='cubic')

# fl and fc are the interpolating functions
# defined in the interval [0 10]
# fl uses linear interpolation
# and fc uses cubic interpolation

xnew = numpy.linspace(0, 10, 50)
pylab.subplot(211)
# the real sin(x) function plot
pylab.plot(xx, yy)
pylab.legend(['sin(x)'], loc='best')
pylab.subplot(212)
# the interpolation
pylab.plot(x, y, 'o', xnew, fl(xnew), xnew, fc(xnew))
pylab.legend(['sample', 'linear', 'cubic'], loc='lower left')
pylab.show()

The script will show the following figure: