from scipy.stats.kde import gaussian_kde from scipy.stats import norm from numpy import linspace,hstack from pylab import plot,show,hist # creating data with two peaks sampD1 = norm.rvs(loc=-1.0,scale=1,size=300) sampD2 = norm.rvs(loc=2.0,scale=0.5,size=300) samp = hstack([sampD1,sampD2]) # obtaining the pdf (my_pdf is a function!) my_pdf = gaussian_kde(samp) # plotting the result x = linspace(-5,5,100) plot(x,my_pdf(x),'r') # distribution function hist(samp,normed=1,alpha=.3) # histogram show()The result should be as follows:
Showing posts with label scipy. Show all posts
Showing posts with label scipy. Show all posts
Thursday, August 16, 2012
Kernel Density Estimation with scipy
This post continues the last one where we have seen how to how to fit two types of distribution functions (Normal and Rayleigh). This time we will see how to use Kernel Density Estimation (KDE) to estimate the probability density function. KDE is a non-parametric technique for density estimation in which a known density function (the kernel) is averaged across the observed data points to create a smooth approximation. Given the non-parametrica nature of KDE, the main estimator has not a fixed functional form but only it depends upon all the data points we used for the estimation. Let's see the snippet:
Friday, July 20, 2012
Distribution fitting with scipy
Distribution fitting is the procedure of selecting a statistical distribution that best fits to a dataset generated by some random process. In this post we will see how to fit a distribution using the techniques implemented in the Scipy library.
This is the first snippet:
In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar. Let's do the same with a Rayleigh distribution:
As expected, the two distributions are very close.
This is the first snippet:
from scipy.stats import norm from numpy import linspace from pylab import plot,show,hist,figure,title # picking 150 of from a normal distrubution # with mean 0 and standard deviation 1 samp = norm.rvs(loc=0,scale=1,size=150) param = norm.fit(samp) # distribution fitting # now, param[0] and param[1] are the mean and # the standard deviation of the fitted distribution x = linspace(-5,5,100) # fitted distribution pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1]) # original distribution pdf = norm.pdf(x) title('Normal distribution') plot(x,pdf_fitted,'r-',x,pdf,'b-') hist(samp,normed=1,alpha=.3) show()The result should be as follows
In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar. Let's do the same with a Rayleigh distribution:
from scipy.stats import norm,rayleigh samp = rayleigh.rvs(loc=5,scale=2,size=150) # samples generation param = rayleigh.fit(samp) # distribution fitting x = linspace(5,13,100) # fitted distribution pdf_fitted = rayleigh.pdf(x,loc=param[0],scale=param[1]) # original distribution pdf = rayleigh.pdf(x,loc=5,scale=2) title('Rayleigh distribution') plot(x,pdf_fitted,'r-',x,pdf,'b-') hist(samp,normed=1,alpha=.3) show()The resulting plot:
As expected, the two distributions are very close.
Sunday, July 8, 2012
Color quantization
The aim of color clustering is to produce a small set of representative colors which captures the color properties of an image. Using the small set of color found by the clustering, a quantization process can be applied to the image to find a new version of the image that has been "simplified," both in colors and shapes.
In this post we will see how to use the K-Means algorithm to perform color clustering and how to apply the quantization. Let's see the code:
We have the original image on the top and the quantized version on the bottom. We can see that the image on the bottom has only six colors. Now, we can plot the colors found with the clustering in the RGB space with the following code:
In this post we will see how to use the K-Means algorithm to perform color clustering and how to apply the quantization. Let's see the code:
from pylab import imread,imshow,figure,show,subplot from numpy import reshape,uint8,flipud from scipy.cluster.vq import kmeans,vq img = imread('clearsky.jpg') # reshaping the pixels matrix pixel = reshape(img,(img.shape[0]*img.shape[1],3)) # performing the clustering centroids,_ = kmeans(pixel,6) # six colors will be found # quantization qnt,_ = vq(pixel,centroids) # reshaping the result of the quantization centers_idx = reshape(qnt,(img.shape[0],img.shape[1])) clustered = centroids[centers_idx] figure(1) subplot(211) imshow(flipud(img)) subplot(212) imshow(flipud(clustered)) show()The result shoud be as follows:
We have the original image on the top and the quantized version on the bottom. We can see that the image on the bottom has only six colors. Now, we can plot the colors found with the clustering in the RGB space with the following code:
# visualizing the centroids into the RGB space from mpl_toolkits.mplot3d import Axes3D fig = figure(2) ax = fig.gca(projection='3d') ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],c=centroids/255.,s=100) show()And this is the result:
Thursday, April 5, 2012
K- means clustering with scipy
K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:
In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:
- for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
- the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:
from pylab import plot,show from numpy import vstack,array from numpy.random import rand from scipy.cluster.vq import kmeans,vq # data generation data = vstack((rand(150,2) + array([.5,.5]),rand(150,2))) # computing K-Means with K = 2 (2 clusters) centroids,_ = kmeans(data,2) # assign each sample to a cluster idx,_ = vq(data,centroids) # some plotting using numpy's logical indexing plot(data[idx==0,0],data[idx==0,1],'ob', data[idx==1,0],data[idx==1,1],'or') plot(centroids[:,0],centroids[:,1],'sg',markersize=8) show()The result should be as follows:
In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:
# now with K = 3 (3 clusters) centroids,_ = kmeans(data,3) idx,_ = vq(data,centroids) plot(data[idx==0,0],data[idx==0,1],'ob', data[idx==1,0],data[idx==1,1],'or', data[idx==2,0],data[idx==2,1],'og') # third cluster points plot(centroids[:,0],centroids[:,1],'sm',markersize=8) show()This time the the result is as follows:
Saturday, March 24, 2012
Linear regression with Numpy
Few post ago, we have seen how to use the function numpy.linalg.lstsq(...) to solve an over-determined system. This time, we'll use it to estimate the parameters of a regression line.
A linear regression line is of the form w1x+w2=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (xi, yi), the parameters that we are looking for are w1 and w2 which minimize the error
and we can compute the parameter vector w = (w1 , w2)T as the least-squares solution of the following over-determined system
Let's use numpy to compute the regression line:
You can find more about data fitting using numpy in the following posts: Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):
A linear regression line is of the form w1x+w2=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (xi, yi), the parameters that we are looking for are w1 and w2 which minimize the error
and we can compute the parameter vector w = (w1 , w2)T as the least-squares solution of the following over-determined system
Let's use numpy to compute the regression line:
from numpy import arange,array,ones,linalg from pylab import plot,show xi = arange(0,9) A = array([ xi, ones(9)]) # linearly generated sequence y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24] w = linalg.lstsq(A.T,y)[0] # obtaining the parameters # plotting the line line = w[0]*xi+w[1] # regression line plot(xi,line,'r-',xi,y,'o') show()We can see the result in the plot below.
You can find more about data fitting using numpy in the following posts: Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):
from numpy import arange,array,ones#,random,linalg from pylab import plot,show from scipy import stats xi = arange(0,9) A = array([ xi, ones(9)]) # linearly generated sequence y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24] slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y) print 'r value', r_value print 'p_value', p_value print 'standard deviation', std_err line = slope*xi+intercept plot(xi,line,'r-',xi,y,'o') show()
Wednesday, September 14, 2011
Uncertainty principle and spectrogram with pylab
The Fourier transform does not give any information on the time at which a frequency component occurs. One approach which can give information on the time resolution of the spectrum is the Short Time Fourier Transform (STFT). Here a moving window is applied to the signal and the Fourier transform is applied to the signal within the window as the window is moved [Ref]. The choice of window is very important with respect to the performance of the STFT in practice. Since the STFT is simply applying the Fourier transform to pieces of the time series of interest, a drawback of the STFT is that it will not be able to resolve events if they happen to appear within the width of the window. In this case, the lack of time resolution of the Fourier transform is present. In general, one cannot achieve simultaneous time and frequency resolution because of the Heisenberg uncertain principle. In the field of particle physics, an elementary particle does not have precise position and momentum. The better one determines the position of the particle, the less precisely is know at that time, and vice versa. For signal processing, this rule translates into the fact that a signal does not simultaneously have a precise location in time and precise frequency [Ref].
The library pylab provides the function specgram(...) to compute the spectrogram of a signal using the STFT. The following script uses that function to show the spectrogram of a signal with different windows size:
The library pylab provides the function specgram(...) to compute the spectrogram of a signal using the STFT. The following script uses that function to show the spectrogram of a signal with different windows size:
from scipy.io.wavfile import read,write from pylab import plot,show,subplot,specgram # Open the Homer Simpson voice: "Ummm, Crumbled up cookie things." # from http://www.thesoundarchive.com/simpsons/homer/mcrumble.wav rate,data = read('mcrumble.wav') # reading subplot(411) plot(range(len(data)),data) subplot(412) # NFFT is the number of data points used in each block for the FFT # and noverlap is the number of points of overlap between blocks specgram(data, NFFT=128, noverlap=0) # small window subplot(413) specgram(data, NFFT=512, noverlap=0) subplot(414) specgram(data, NFFT=1024, noverlap=0) # big window show()This image is the result: The pictures shows that changing the number of data points used for each Fourier transform block, the spectrogram loses definition in frequency or in the time.
Thursday, September 8, 2011
Sound Synthesis
Physically, sound is an oscillation of a mechanical medium that makes the surrounding air also oscillate and transport the sound as a compression wave. Mathematically, the oscillations can be described as
where t is the time, and f the frequency of the oscillation. Sound on a computer is a sequence of numbers and in this post we will see how to generate a musical tone with numpy and how to write it to a file wav file. Each musical note vibrates at a particular frequency and the following script contains a function to generate a note (tone(...)), we'll use this function to generate the A tone creating an oscillation with f = 440 Hz.
And using the plotSpectrum function defined in a previous post we can verify that 440 Hz is the fundamental frequency of the tone.

from scipy.io.wavfile import write from numpy import linspace,sin,pi,int16 from pylab import plot,show,axis # tone synthesis def note(freq, len, amp=1, rate=44100): t = linspace(0,len,len*rate) data = sin(2*pi*freq*t)*amp return data.astype(int16) # two byte integers # A tone, 2 seconds, 44100 samples per second tone = note(440,2,amp=10000) write('440hzAtone.wav',44100,tone) # writing the sound to a file plot(linspace(0,2,2*44100),tone) axis([0,0.4,15000,-15000]) show()Now we can play the file 440hzAtone.wav with an external player. This plot shows a part of the signal generated by the script:
And using the plotSpectrum function defined in a previous post we can verify that 440 Hz is the fundamental frequency of the tone.
Wednesday, August 3, 2011
How to plot the frequency spectrum with scipy
Spectrum analysis is the process of determining the frequency domain representation of a time domain signal and most commonly employs the Fourier transform. The Discrete Fourier Transform (DFT) is used to determine the frequency content of signals and the Fast Fourier Transform (FFT) is an efficient method for calculating the DFT. Scipy implements FFT and in this post we will see a simple example of spectrum analysis:
from numpy import sin, linspace, pi from pylab import plot, show, title, xlabel, ylabel, subplot from scipy import fft, arange def plotSpectrum(y,Fs): """ Plots a Single-Sided Amplitude Spectrum of y(t) """ n = len(y) # length of the signal k = arange(n) T = n/Fs frq = k/T # two sides frequency range frq = frq[range(n/2)] # one side frequency range Y = fft(y)/n # fft computing and normalization Y = Y[range(n/2)] plot(frq,abs(Y),'r') # plotting the spectrum xlabel('Freq (Hz)') ylabel('|Y(freq)|') Fs = 150.0; # sampling rate Ts = 1.0/Fs; # sampling interval t = arange(0,1,Ts) # time vector ff = 5; # frequency of the signal y = sin(2*pi*ff*t) subplot(2,1,1) plot(t,y) xlabel('Time') ylabel('Amplitude') subplot(2,1,2) plotSpectrum(y,Fs) show()The program shows the following figure, on top we have a plot of the signal and on the bottom the frequency spectrum.
Thursday, June 2, 2011
Integration of a function using quad from scipy
Example of integration of a function using scipy:
from scipy.integrate import quad f = lambda x: 4-x*x # function to integrate p = lambda x: (-x*x*x)/3 + 4*x # the primitive of f # let's try to integrate f in the interval [0 2] r = quad(f,0,2) # integration using quad r_analytic = p(2.0)-p(0.0) # analytic integration print 'Results' print r[0] print r_analyticNumerical and analytic result compared:
Results 5.33333333333 5.33333333333
Saturday, May 28, 2011
Data fitting using fmin
We have seen already how to find the minimum of a function using fmin, in this example we will see how use it to fit a set of data with a curve minimizing an error function:
The parameters will be printed also:
from pylab import * from numpy import * from numpy.random import normal from scipy.optimize import fmin # parametric function, x is the independent variable # and c are the parameters. # it's a polynomial of degree 2 fp = lambda c, x: c[0]+c[1]*x+c[2]*x*x real_p = rand(3) # error function to minimize e = lambda p, x, y: (abs((fp(p,x)-y))).sum() # generating data with noise n = 30 x = linspace(0,1,n) y = fp(real_p,x) + normal(0,0.05,n) # fitting the data with fmin p0 = rand(3) # initial parameter value p = fmin(e, p0, args=(x,y)) print 'estimater parameters: ', p print 'real parameters: ', real_p xx = linspace(0,1,n*3) plot(x,y,'bo', xx,fp(real_p,xx),'g', xx, fp(p,xx),'r') show()The following figure will be showed, in green the original curve used to generate the noisy data, in blue the noisy data and in red the curve found in the minimization process:
The parameters will be printed also:
Optimization terminated successfully. Current function value: 0.861885 Iterations: 77 Function evaluations: 146 estimater parameters: [ 0.92504602 0.87328979 0.64051926] real parameters: [ 0.86284356 0.95994753 0.67643758]
Tuesday, May 17, 2011
How to interpolate a set of points
The purpose of this example is to show how to interpolate a set of points (x,y) using the funtion interp1 provided by scipy.
import scipy.interpolate as sp import numpy import pylab # 50 points of sin(x) in [0 10] xx = numpy.linspace(0, 10, 50) yy = numpy.sin(xx) # 10 sample of sin(x) in [0 10] x = numpy.linspace(0, 10, 10) y = numpy.sin(x) # interpolation fl = sp.interp1d(x, y,kind='linear') fc = sp.interp1d(x, y,kind='cubic') # fl and fc are the interpolating functions # defined in the interval [0 10] # fl uses linear interpolation # and fc uses cubic interpolation xnew = numpy.linspace(0, 10, 50) pylab.subplot(211) # the real sin(x) function plot pylab.plot(xx, yy) pylab.legend(['sin(x)'], loc='best') pylab.subplot(212) # the interpolation pylab.plot(x, y, 'o', xnew, fl(xnew), xnew, fc(xnew)) pylab.legend(['sample', 'linear', 'cubic'], loc='lower left') pylab.show()The script will show the following figure:
Subscribe to:
Posts (Atom)