Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Friday, June 4, 2021

The Central Limit Theorem, a hands-on introduction

The central limit theorem can be informally summarized in few words: The sum of x1, x2, ... xn samples from the same distribution is normally distributed, provided that n is big enough and that the distribution has a finite variance. to show this in an experimental way, let's define a function that sums n samples from the same distrubution for 100000 times:
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt

def sum_random_variables(*kwarg, sp_distribution, n):
    # returns the sum of n random samples
    # drawn from sp_distribution
    v = [sp_distribution.rvs(*kwarg, size=100000) for _ in range(n)]
    return np.sum(v, axis=0)
This function takes in input the parameters of the distrubution, the function that implements the distrubution and n. It returns an array of 100000 elements, where each element is the sum of n samples. Given the Central Limit Theorem, we expect that the values in output are normally distributed if n is big enough. To verify this, let's consider a beta distribution with parameters alpha=1 and beta=2, run our function increasing n and plot the histogram of the values in output:
plt.figure(figsize=(9, 3))
N = 5
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 2, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
plt.tight_layout()
On the far left we have the histogram with n=1 , the one with n=2 right next to it, and so on until n=4. With n=1 we have the original distribution, which is heavily skewed. With n=2 we have a distribution which is less skewed. When we reach n=4 we see that the distribution is almost symmetrical, resembling a normal distribution.

Let's do the same experiment using a uniform distribution:
plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(1, 1, sp_distribution=sps.beta, n=n)
    plt.hist(s, density=True)
plt.tight_layout()
Here we have that for n=2 the distribution is already symmetrical, resembling a triangle, and increasing n further we get closer to the shape of a Gaussian.

The same behaviour can be shown for discrete distributions. Here's what happens if we use the Bernoulli distribution:
plt.figure(figsize=(9, 3))
for n in range(1, N):
    plt.subplot(1, N-1, n)
    s = sum_random_variables(.5, sp_distribution=sps.bernoulli, n=n)
    plt.hist(s, bins=n+1, density=True, rwidth=.7)
plt.tight_layout()
We see again that for n=2 the distribution starts to be symmetrical and that the shape of a Gaussian is almost clear for n=4.

Tuesday, March 17, 2020

Ridgeline plots in pure matplotlib

A Ridgeline plot (also called Joyplot) allows us to compare several statistical distributions. In this plot each distribution is shown with a density plot, and all the distributions are aligned to the same horizontal axis and, sometimes, presented with a slight overlap.

There are many options to make a Ridgeline plot in Python (joypy being one of them) but I decided to make my own function using matplotlib to have full flexibility and minimal dependencies:
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

def ridgeline(data, overlap=0, fill=True, labels=None, n_points=150):
    """
    Creates a standard ridgeline plot.

    data, list of lists.
    overlap, overlap between distributions. 1 max overlap, 0 no overlap.
    fill, matplotlib color to fill the distributions.
    n_points, number of points to evaluate each distribution function.
    labels, values to place on the y axis to describe the distributions.
    """
    if overlap > 1 or overlap < 0:
        raise ValueError('overlap must be in [0 1]')
    xx = np.linspace(np.min(np.concatenate(data)),
                     np.max(np.concatenate(data)), n_points)
    curves = []
    ys = []
    for i, d in enumerate(data):
        pdf = gaussian_kde(d)
        y = i*(1.0-overlap)
        ys.append(y)
        curve = pdf(xx)
        if fill:
            plt.fill_between(xx, np.ones(n_points)*y, 
                             curve+y, zorder=len(data)-i+1, color=fill)
        plt.plot(xx, curve+y, c='k', zorder=len(data)-i+1)
    if labels:
        plt.yticks(ys, labels)
The function takes in input a list of datasets where each dataset contains the values to derive a single distribution. Each distribution is estimated using Kernel Density Estimation, just as we've seen previously, and plotted increasing the y value.

Let's generate data from few normal distributions with different means and have a look at the output of the function:
data = [norm.rvs(loc=i, scale=2, size=50) for i in range(8)]
ridgeline(data, overlap=.85, fill='y')


Not too bad, we can clearly see that each distribution has a different mean. Let's apply the function on real world data:
import pandas as pd
data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])
co2_data = co2_data[co2_data.year >= 2000]
co2_data = co2_data[co2_data.year != 2020]

plt.figure(figsize=(8, 10))
grouped = [(y, g.ppm.dropna().values) for y, g in co2_data.groupby('year')]
years, data = zip(*grouped)
ridgeline(data, labels=years, overlap=.85, fill='tomato')
plt.title('Distribution of CO2 levels per year since 2000',
          loc='left', fontsize=18, color='gray')
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.xlabel('ppm')
plt.xlim((co2_data.ppm.min(), co2_data.ppm.max()))
plt.ylim((0, 3.1))
plt.grid(zorder=0)
plt.show()


In the snippet above we downloaded the measurements of the concentration of CO2 in the atmosphere, the same data was also used here, and grouped the values by year. Then, we generated a Ridgeline plot that shows the distribution of CO2 levels each year since 2000. We easily note that the average concentration went from 370ppm to 420pmm gradually increasing over the 19 years abserved. We also note that the span of each distribution is approximatively 10ppm.

Sunday, August 11, 2019

Visualizing distributions with scatter plots in matplotlib

Let's say that we want to study the time between the end of a marked point and next serve in a tennis game. After gathering our data, the first thing that we can do is to draw a histogram of the variable that we are interested in:

import pandas as pd
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.hist(event.seconds_before_next_point, bins=10)
plt.xlabel('Seconds before next serve')
plt.show()


The histogram reveals some interesting aspects of the distribution, indeed we can see that data is slightly skewed to the right and that on average the server takes 20 seconds. However, we couldn't tell how many time the serves happens before 10 seconds or after 35. Of course, one could increase the bins of the histogram, but this would lead to a chart which is not particularly elegant and that might hide some other details.

To have a better understanding of the situation we can draw a scatter plot of the variable we are studying:
import numpy as np
from scipy.stats.kde import gaussian_kde

def distribution_scatter(x, symmetric=True, cmap=None, size=None):
    """
    Plot the distribution of x showing all the points.
    The x axis represents the samples in x
    and the y axis is function of the probability of x
    and random assignment.
    
    Returns the position on the y axis.
    """
    pdf = gaussian_kde(x)
    w = np.random.rand(len(x))
    if symmetric:
        w = w*2-1
    pseudo_y = pdf(x) * w
    if cmap:
        plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
    else:
        plt.scatter(x, pseudo_y, s=size)
    return pseudo_y


In this chart each sample is represented with a point and the spread of the points in the y direction depends on the probability of occurrence. In this case we can easily see that 4 serves happened before 10 seconds and 3 after 35.

Since we're not really interested on the values on y axis but only on the spread, we can remove the axis and add few details on the outliers to enrich the chart:

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.figure(figsize=(7, 11))
title = 'Time in seconds between'
title += '\nend of marked point and next serve'
title += '\nat 2015 French Open'
plt.title(title, loc='left', fontsize=18, color='gray')
py = distribution_scatter(event.seconds_before_next_point, cmap='cool');


cut_h = np.percentile(event.seconds_before_next_point, 98)
outliers = event.seconds_before_next_point> cut_h


ha = {True: 'right', False: 'left'}
for x, y, c in zip(event[outliers].seconds_before_next_point,
                   py[outliers],
                   event[outliers].server):
    plt.text(x, y+.0005, c,
             ha=ha[x<0], va='bottom', fontsize=12)

plt.xlabel('Seconds before next serve', fontsize=15)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.yticks([])
plt.xticks(np.arange(5, 41, 5))
plt.xlim([5, 40])
plt.show()


Where to go next:

Friday, May 17, 2019

Feelings toward immigration of people from other EU Member States in November 2018

In this post we will see a snippet about how to plot a part of the results of the eurobarometer survey released last March. In particular, we will focus on the responses to the following question:
Please tell me whether the following statement evokes a positive or negative feeling for you: Immigration of people from other EU Member States.
The data from the main spreadsheet reporting the results country by country was isolated in a csv file (then uploaded on github) so that it could be easily loaded in Pandas as follows:
import pandas as pd

# github gist
gist = 'https://gist.githubusercontent.com/JustGlowing/'
gist += '2c25b9b153192baf573ce3b744ea6a65/raw/'
gist += '5f3888f7f42caca58b2418ec5822425083b6d559/'
gist += 'immigration_from_EU_eurobarometer_2018.csv'
df = pd.read_csv(gist, index_col=0)
df = df[df.index.map(lambda x: not '\n' in x)]
df.sort_values(by=["Total 'Positive'"], inplace=True)

# from https://ec.europa.eu/eurostat/statistics-explained/index.php
country_names = {'BE' : 'Belgium',
'BG' : 'Bulgaria',
'CZ' : 'Czechia',
'DK' : 'Denmark',
'DE' : 'Germany',
'EE' : 'Estonia',
'IE' : 'Ireland',
'EL' : 'Greece',
'ES' : 'Spain',
'FR' : 'France',
'HR' : 'Croatia',
'IT' : 'Italy',
'CY' : 'Cyprus',
'LV' : 'Latvia',
'LT' : 'Lithuania',
'LU' : 'Luxembourg',
'HU' : 'Hungary',
'MT' : 'Malta',
'NL' : 'Netherlands',
'AT' : 'Austria',
'PL' : 'Poland',
'PT' : 'Portugal',
'RO' : 'Romania',
'SI' : 'Slovenia',
'SK' : 'Slovakia',
'FI' : 'Finland',
'SE' : 'Sweden',
'UK' : 'United Kingdom'}

df.index = df.index.map(country_names.get)
The idea is to create a bar chart with two sides, positive responses on the right and negative on the left. To do this, we can use the function barh and the attribute left can be used to stack the two subsets of responses ("Fairly positive/ negative" and "Very positive/negative"). The xticks also need to be adapted to reflect that the left side of the axis doesn't report values below zero. Here's the snippet:
import matplotlib.pyplot as plt
import numpy as np

country_idx = range(len(df))

plt.figure(figsize=(11, 14))
plt.barh(country_idx, df['Fairly positive'],
         color='deepskyblue',label='Fairly positive')
plt.barh(country_idx, df['Very positive'], left=df['Fairly positive'],
         color='dodgerblue', label='Very positive')
plt.barh(country_idx, -df['Fairly negative'],
         color='tomato', label='Fairly negative')
plt.barh(country_idx, -df['Very negative'], left=-df['Fairly negative'],
         color='firebrick', label='Very negative')

plt.yticks(country_idx, df.index)
plt.xlim([-100, 100])
plt.xticks(np.arange(-100, 101, 25), np.abs(np.arange(-100, 101, 25)))
plt.ylim([-.5, len(df)-.5])
title = 'Feelings toward immigration of people from\n'
title += 'other EU Member States in November 2018'
plt.title(title)
xlbl = 'negative            <<<       % responses       >>>            positive'
plt.xlabel(xlbl)
plt.legend(loc='lower right')

bbox_props = dict(fc="white", ec="k", lw=2) 
plt.text(-95, 27, 'twitter: @justglowing \nhttps://glowingpython.blogspot.com',
         ha="left", va="center", size=11, bbox=bbox_props)
plt.show()


From the chart we note that the percentage of positive responses per country is mostly above 50% while the negative ones reach 50% only in two cases. We also see that Ireland and Sweden are the countries with the most positive responses, while Czechia (yes, that's Chech Republic :) is the country with most negative responses, though Cypriots also gave a similar number of "Very negative" responses.

Wednesday, November 26, 2014

Comparing strikers statistics

Here we compare the scoring statistics of four of the best strikers of the recent football history: Del Piero, Trezeguet, Ronaldo and Vieri. The statistics that we will look at are the scoring trajectory, scoring rate and number of appearances.
To compute these values we need to scrape the career statistics (number of goals and appearances per season) on the Wikipedia pages of the players:
from bs4 import BeautifulSoup
from urllib2 import urlopen

def get_total_goals(url):
    """
    Given the url of a wikipedia page about a football striker
    returns three numy arrays:
    - years, each element corresponds to a season
    - apprearances, contains the number of appearances each season
    - goals, contains the number of goal scored each season
    
    Unfortunately this function is able to parse 
    only the pages of few strikers.
    """
    soup = BeautifulSoup(urlopen(url).read())
    table = soup.find("table", { "class" : "wikitable" })
    years = []
    apps = []
    goals = []
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        if len(cells) > 1:
            years.append(int(cells[0].text[:4]))
            apps.append(int(cells[len(cells)-2].text))
            goals.append(int(cells[len(cells)-1].text))
    return np.array(years), 
           np.array(apps, dtype='float'), 
           np.array(goals)

ronaldo = get_total_goals('http://en.wikipedia.org/wiki/Ronaldo')
vieri = get_total_goals('http://en.wikipedia.org/wiki/Christian_Vieri')
delpiero = get_total_goals('http://en.wikipedia.org/wiki/Alessandro_Del_Piero')
trezeguet = get_total_goals('http://en.wikipedia.org/wiki/David_Trezeguet')
Now we are ready to compute our statistics. For each statistics we will produce an interactive chart using plotly.

Scoring trajectory

import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("sexyusername", "mypassword")

data = Data([
    Scatter(x=delpiero[0],y=cumsum(delpiero[2]), 
            name='Del Piero', mode='lines'),
    Scatter(x=trezeguet[0],y=cumsum(trezeguet[2]), 
            name='Trezeguet', mode='lines'),
    Scatter(x=ronaldo[0],y=cumsum(ronaldo[2]), 
            name='Ronaldo', mode='lines'),
    Scatter(x=vieri[0],y=cumsum(vieri[2]), 
            name='Vieri', mode='lines'),
])

layout = Layout(
    title='Scoring Trajectory',
    xaxis=XAxis(title='Year'),
    yaxis=YAxis(title='Cumuative goal'),
    legend=Legend(x=0.0,y=1.0))

fig = Figure(data=data, layout=layout)

py.iplot(fig, filename='cumulative-goals')
The scoring trajectory is given by the yearly cumulative totals of goals scored. From the scoring trajectories we can see that Ronaldo was a goal machine since his first professional season and his worse period was from 1999 to 2001. Del Piero and Trezeguet have the longest careers (and they're still playing!). Vieri had the shortest career but it's impressive to see that the number of goals he scored increased almost constantly from 1996 to 2004.

Scoring rate

data = Data([
    Bar(
        x=['Ronaldo', 'Vieri', 'Trezeguet', 'Del Piero'],
        y=[np.sum(ronaldo[2])/np.sum(ronaldo[1]), 
           np.sum(vieri[2])/np.sum(vieri[1]),
           np.sum(trezeguet[2])/np.sum(trezeguet[1]),
           np.sum(delpiero[2])/np.sum(delpiero[1])]
    )
])
py.iplot(data, filename='goal-average')
The scoring rate is the number of goals scored divided by the number of appearances. Ronaldo has a terrific 0.67 scoring rate, meaning that, on average he scored more than three goals each five games. Vieri and Trezeguet have a very similar scoring rate, almost one goal each two games. While Del Piero has 0.40, two goals each five games.

Appearances

data = Data([
    Bar(
        x=['Del Piero', 'Trezeguet', 'Ronaldo', 'Vieri'],
        y=[np.sum(delpiero[1]),
           np.sum(trezeguet[1]),
           np.sum(ronaldo[1]),
           np.sum(vieri[1])]
    )
])
py.iplot(data, filename='appearances')
The number of Del Piero's appearances on a football field is impressive. At the moment I'm writing, he played 773 games. No one of the other players was able to play the 70% of the games played by the Italian numero 10.

Friday, October 17, 2014

Andrews curves

Andrews curves are a method for visualizing multidimensional data by mapping each observation onto a function. This function is defined as


It has been shown the Andrews curves are able to preserve means, distance (up to a constant) and variances. Which means that Andrews curves that are represented by functions close together suggest that the corresponding data points will also be close together. Now, we will demonstrate the effectiveness of the Andrew curves on the iris dataset (which we already used here). Let's create a function to compute the values of the functions give a single sample:
import numpy as np
def andrew_curve4(x,theta):
    # iris has 4 four dimensions
    base_functions = [lambda x : x[0]/np.sqrt(2.), 
                      lambda x : x[1]*np.sin(theta), 
                      lambda x : x[2]*np.cos(theta), 
                      lambda x : x[3]*np.sin(2.*theta)]
    curve = np.zeros(len(theta))
    for f in base_functions:
        curve = curve + f(x)
    return curve
At this point we can load the dataset and plot the curves for a subset of samples:
samples = np.loadtxt('iris.csv', usecols=[0,1,2,3], delimiter=',')
#samples = samples - np.mean(samples)
#samples = samples / np.std(samples)
classes = np.loadtxt('iris.csv', usecols=[4], delimiter=',',dtype=np.str)
theta = np.linspace(-np.pi,np.pi,100)
import pylab as pl
for s in samples[:20]: # setosa
    pl.plot(theta, andrew_curve4(s,theta), 'r')

for s in samples[50:70]: # versicolor
    pl.plot(theta, andrew_curve4(s,theta), 'b')

for s in samples[100:120]: # virginica
    pl.plot(theta, andrew_curve4(s,theta), 'g')

pl.xlim(-np.pi,np.pi)
pl.show()


In the plot above, the each color used represents a class and we can easily note that the lines that represent samples from the same class have similar curves.

Tuesday, April 22, 2014

Parameters selection with Cross-Validation

Most of the pattern recognition techniques have one or more free parameters and choose them for a given classification problem is often not a trivial task. In real applications we only have access to a finite set of examples, usually smaller than we wanted, and we need to test our model on samples not seen during the training process. A model that would just classify the samples that it has seen would have a very good score, but would definitely fail to predict unseen data. This situation is called overfitting and to avoid it we need to apply an appropriate validation procedure to select the parameters. A tool that can help us solve this problem is the Cross-Validation (CV). The idea behind CV is simple: the data are split into train and test sets several consecutive times and the averaged value of the prediction scores obtained with the different sets is the evaluation of the classifier.
Let's see a simple example where a smoothing parameter for a Bayesian classifier is select using the capabilities of the Sklearn library.
To begin we load one of the test datasets provided by sklearn (the same used here) and we hold 33% of the samples for the final evaluation:
from sklearn.datasets import load_digits
data = load_digits()
from sklearn.cross_validation import train_test_split
X,X_test,y,y_test = train_test_split(data.data,data.target,
                                     test_size=.33,
                                     random_state=1899)
Now, we import the classifier we want to use (a Bernoullian Naive Bayes in this case), specify a set of values for the parameter we want to choose and run a grid search:
from sklearn.naive_bayes import BernoulliNB
# test the model for alpha = 0.1, 0.2, ..., 1.0
parameters = [{'alpha':np.linspace(0.1,1,10)}]

from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(BernoulliNB(), parameters, cv=10, scoring='f1')
clf.fit(X,y) # running the grid search
The grid search has evaluated the classifier for each value specified for the parameter alpha using the CV. We can visualize the results as follows:
res = zip(*[(f1m, f1s.std(), p['alpha']) 
            for p, f1m, f1s in clf.grid_scores_])
subplot(2,1,1)
plot(res[2],res[0],'-o')
subplot(2,1,2)
plot(res[2],res[1],'-o')
show()

The plots above show the average score (top) and the standard deviation of the score (bottom) for each values of alpha used. Looking at the graphs it seems plausible that a small alpha could be a good choice.
We can also see thet using the alpha value that gave us the best results on the the test set we selected at the beginning gives us results that are similar to the ones obtained during the CV stage:
from sklearn.metrics import f1_score
print 'Best alpha in CV = %0.01f' % clf.best_params_['alpha']
final = f1_score(y_test,clf.best_estimator_.predict(X_test))
print 'F1-score on the final testset: %0.5f' % final
Best alpha in CV = 0.1
F1-score on the final testset: 0.85861

Wednesday, February 26, 2014

Terms selection with chi-square

In Natural Language Processing, the identification the most relevant terms in a collection of documents is a common task. It can produce meaningful insights about the data and it can also be useful to improve classification performances and computational efficiency. A popular measure of relevance for terms is the χ2 statistic. To compute it we can convert the terms of our document collection and turn them into features of a vectorial model, then χ2 can be computed as follow:


Where f is a feature (a term in this case), t is a target variable that we, usually, want to predict, A is the number of times that f and t cooccur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither t or f occur and N is the number of observations.

Let's see how χ2 can be used through a simple example. We load some posts from 4 different newsgroups categories using the sklearn interface:
from sklearn.datasets import fetch_20newsgroups
 # newsgroups categories
categories = ['alt.atheism','talk.religion.misc',
              'comp.graphics','sci.space']

posts = fetch_20newsgroups(subset='train', categories=categories,
                           shuffle=True, random_state=42,
                           remove=('headers','footers','quotes'))
From the posts loaded, we build a linear model using all the terms in the document collection but the stop words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(posts.data)
Now, X is a document-term matrix where the element Xi,j is the frequency of the term j in the document i. Then, the features are given by the columns of X and we want to compute χ2 between the categories of interest and each feature in order to figure out what are the most relevant terms. This can be done as follows
from sklearn.feature_selection import chi2
# compute chi2 for each feature
chi2score = chi2(X,posts.target)[0]
To have a visual insight, we can plot a bar chart where each bar shows the χ2 value computed above:
from pylab import barh,plot,yticks,show,grid,xlabel,figure
figure(figsize=(6,6))
wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-25:])
x = range(len(topchi2[1]))
labels = topchi2[0]
barh(x,topchi2[1],align='center',alpha=.2,color='g')
plot(topchi2[1],x,'-o',markersize=2,alpha=.8,color='g')
yticks(x,labels)
xlabel('$\chi^2$')
show()



We can observe that the terms with a high χ2 can be considered relevant for the newsgroup categories we are analyzing. For example, the terms space, nasa and launch can be considered relevant for the group sci.space. The terms god, jesus and atheism can be considered relevant for the groups alt.atheism and talk.religion.misc. And, the terms image, graphics and jpeg can be considered relevant in the category comp.graphics.

Sunday, September 15, 2013

Self Organizing Maps

Notice: For an update tutorial on how to use minisom refere to the examples in the official documentation.
The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1999) is the following


Each neuron n has a vector wn of weights associated. The process for training a SOM involves stepping through several training iteration until the item in your dataset are learnt by the SOM. For each pattern x one neuron n will "win" (which means that wn is the weights vector more similar to x) and this winning neuron will have its weights adjusted so that it will have a stronger response to the input the next time it sees it (which means that the distance between x and wn will be smaller). As different neurons win for different patterns, their ability to recognize that particular pattern will increase. The training algorithm can be summarized as follows:
  1. Initialize the weights of each neuron.
  2. Initialize t = 0
  3. Randomly pick an input x from the dataset
  4. Determine the winning neuron i as the neuron such that

  5. Adapt the weights of each neuron n according to the following rule

  6. Increment t by 1
  7. if t < tmax go to step 3
We have that η(t) is called learning rate and that h(i) is called neighborhood function which has high values for i and the neurons close to i on the lattice (a Gaussian centered on i is a good example of neighborhood function). And, when t increases η also decrease and h decrease its spread. This way at each training step the weights of the neurons close to the winning one are adjusted to have a stronger response to the current pattern. After the training process, we have that the locations of the neurons become ordered and a meaningful coordinate system for the input features is created on the lattice. So, if we consider the coordinates of the associated winning neuron for each patter the SOM forms a topographic map of the input patterns.

MiniSom is a minimalistic and Numpy based implementation of the SOM. I made it during the experiments for my thesis in order to have fully hackable SOM algorithm and lately I decided to release it on GitHub. The next part of this post will show how to train MiniSom on the Iris Dataset and how to visualize the result. The first step is to import and normalize the data:
from numpy import genfromtxt,array,linalg,zeros,apply_along_axis

# reading the iris dataset in the csv format    
# (downloaded from http://aima.cs.berkeley.edu/data/iris.csv)
data = genfromtxt('iris.csv', delimiter=',',usecols=(0,1,2,3))
# normalization to unity of each pattern in the data
data = apply_along_axis(lambda x: x/linalg.norm(x),1,data)
The snippet above reads the dataset from a CSV and creates a matrix where each row corresponds to a pattern. In this case, we have that each pattern has 4 dimensions. (Note that only the first 4 columns of the file are used because the fifth column contains the labels). The training process can be started as follows:
from minisom import MiniSom
### Initialization and training ###
som = MiniSom(7,7,4,sigma=1.0,learning_rate=0.5)
som.random_weights_init(data)
print("Training...")
som.train_random(data,100) # training with 100 iterations
print("\n...ready!")
Now we have a 7-by-7 SOM trained on our dataset. MiniSom uses a Gaussian as neighborhood function and its initial spread is specified with the parameter sigma. While with the parameter learning_rate we can specify the initial learning rate. The training algorithm implemented decreases both parameters as training progresses. This allows rapid initial training of the neural network that is then "fine tuned" as training progresses.
To visualize the result of the training we can plot the average distance map of the weights on the map and the coordinates of the associated winning neuron for each patter:
from pylab import plot,axis,show,pcolor,colorbar,bone
bone()
pcolor(som.distance_map().T) # distance map as background
colorbar()
# loading the labels
target = genfromtxt('iris.csv',
                    delimiter=',',usecols=(4),dtype=str)
t = zeros(len(target),dtype=int)
t[target == 'setosa'] = 0
t[target == 'versicolor'] = 1
t[target == 'virginica'] = 2
# use different colors and markers for each label
markers = ['o','s','D']
colors = ['r','g','b']
for cnt,xx in enumerate(data):
 w = som.winner(xx) # getting the winner
 # palce a marker on the winning position for the sample xx
 plot(w[0]+.5,w[1]+.5,markers[t[cnt]],markerfacecolor='None',
   markeredgecolor=colors[t[cnt]],markersize=12,markeredgewidth=2)
axis([0,som.weights.shape[0],0,som.weights.shape[1]])
show() # show the figure
The result should be like the following:


For each pattern in the dataset the corresponding winning neuron have been marked. Each type of marker represents a class of the iris data ( the classes are setosa, versicolor and virginica and they are respectively represented with red, green and blue colors). The average distance map of the weights is used as background (the values are showed in the colorbar on the right). As expected from previous studies on this dataset, the patterns are grouped according to the class they belong and a small fraction of Iris virginica is mixed with Iris versicolor.

For a more detailed explanation of the SOM algorithm you can look at its inventor's paper.

Friday, May 3, 2013

A new RefCard from the GlowingPython!

Check out the DZone RefCard from the GlowingPython:


This Refcard is a collection of code examples that introduces the reader to the principal Data Mining tasks using Python. In the RefCard you will find the following contents:
  • How to import and visualize data.
  • How to classify and cluster data.
  • How to discover relationships in the data using regression and correlation measures.
  • How to reduce the dimensionality of the data in order to compress and visualize the information it brings.
  • How to analyze structured data with networkx.
Each topic is covered with code examples based on four of the major Python libraries for data analysis and manipulation: numpy, matplotlib,sklearn and networkx. Here is a preview of the first two pages:


Click on the preview to get the RefCard!

Wednesday, February 6, 2013

Betweenness Centrality

In Network Analysis the identification of important nodes is a common task. We have various centrality measures that we can use and in this post we will focus on the Betweenness Centrality. We will see how this measure is computed and how to use the library networkx in order to create a visualization of the network where the nodes with the highest betweenness are highlighted. The betweenness focuses on the number of visits through the shortests paths. If a walker moves from one node to another node via the shortests path, then the nodes with a large number of visits have a higher centrality. The betweenness centrality is defined as


where s(s,t) is total number of shortest paths from node s to node t and sv(s,t) is the number of those paths that pass through v.
Let's see how to compute the betweenness with networkx. As first step we have to load a sample network (yes, it's the same of this post):
# read the graph (gml format)
G = nx.read_gml('lesmiserables.gml',relabel=True)
Now we have a representation G of our network and we can use the function betweenness_centrality() to compute the centrality of each node. This function returns a list of tuples, one for each node, and each tuple contains the label of the node and the centrality value. We can use this information in order to trim the original network and keep only the most important nodes:
def most_important(G):
 """ returns a copy of G with
     the most important nodes
     according to the pagerank """ 
 ranking = nx.betweenness_centrality(G).items()
 print ranking
 r = [x[1] for x in ranking]
 m = sum(r)/len(r) # mean centrality
 t = m*3 # threshold, we keep only the nodes with 3 times the mean
 Gt = G.copy()
 for k, v in ranking:
  if v < t:
   Gt.remove_node(k)
 return Gt

Gt = most_important(G) # trimming
And we can use the original network and the trimmed one to visualize the network as follows:
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.2,node_size=8)
nx.draw_networkx_edges(G,pos,alpha=0.1)

# draw the most important nodes with a different style
nx.draw_networkx_nodes(Gt,pos,node_color='r',alpha=0.4,node_size=254)
# also the labels this time
nx.draw_networkx_labels(Gt,pos,font_size=12,font_color='b')
show()
The graph should be like this one:


This graph is pretty interesting, indeed it highlights the nodes which are very influential on the way the information spreads over the network. In the sample network we used each node represents a character and the connection between two characters represent the coappearance in the same chapter of the book 'Les miserable'. Looking at the graph we can easily say what are the most important characters according to the Betweenness Centrality. We can also observe some interesting situations like the ones of Valjean and Myriel. They are to connected to groups of characters who don't have a direct connection with the main ones.

Monday, January 28, 2013

A toy Bloom Filter

A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false positive retrieval results are possible, but false negatives are not. In this post we will see a pure python implementation of the Bloom Filter and the end we will see how to tune the parameters in order to minimize the number of false positive results.
Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length m, initially all set to 0, and then choose k independent hash functions, h1, h2, ..., hk, each with range [1 m]. When an element a is added to the set then the bits at positions h(a)1, h(a)2, ..., h(a)k in the bit vector are set to 1. Given a query element q we can test whether it is in the set using the bits at positions h(q)1, h(q)2, ..., h(q)k in the vector. If any of these bits is 0 we report that q is not in the set otherwise we report that q is. The thing we have to care about is that in the first case there remains some probability that q is not in the set which could lead us to a false positive response.
The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):
class Bloom:
 """ Bloom Filter """
 def __init__(self,m,k,hash_fun):
  """
   m, size of the vector
   k, number of hash fnctions to compute
   hash_fun, hash function to use
  """
  self.m = m
  # initialize the vector 
  # (attention a real implementation 
  #  should use an actual bit-array)
  self.vector = [0]*m
  self.k = k
  self.hash_fun = hash_fun
  self.data = {} # data structure to store the data
  self.false_positive = 0

 def insert(self,key,value):
  """ insert the pair (key,value) in the database """
  self.data[key] = value
  for i in range(self.k):
   self.vector[self.hash_fun(key+str(i)) % self.m] = 1

 def contains(self,key):
  """ check if key is cointained in the database
      using the filter mechanism """
  for i in range(self.k):
   if self.vector[self.hash_fun(key+str(i)) % self.m] == 0:
    return False # the key doesn't exist
  return True # the key can be in the data set

 def get(self,key):
  """ return the value associated with key """
  if self.contains(key):
   try:
    return self.data[key] # actual lookup
   except KeyError:
    self.false_positive += 1
The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for k and the size of the bit vector then we can start adding items as in this example:
import hashlib

def hash_f(x):
 h = hashlib.sha256(x) # we'll use sha256 just for this example
 return int(h.hexdigest(),base=16)

b = Bloom(100,10,hash_f)
b.insert('this is a key','this is a value')
print b.get('this is a key')
Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting n elements into a table of size m, the probability that a particular bit is still 0 is exactly


Hence, afer n insertions, the probability that a certain bit is 1 is


So, for fixed parameters m and n, the optimal value k that minimizes this probability is


With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of m, n and k countinig the percentage of false positive:
import random

def rand_data(n, chars):
 """ generate random strings using the characters in chars """
 return ''.join(random.choice(chars) for i in range(n))

def bloomTest(m,n,k):
 """ return the percentage of false positive """
 bloom = Bloom(m,k,hash_f)
 # generating a random data
 rand_keys = [rand_data(10,'abcde') for i in range(n)]
 # pushing the items into the data structure
 for rk in rand_keys:
  bloom.insert(rk,'data')
 # adding other elements to the dataset
 rand_keys = rand_keys + [rand_data(10,'fghil') for i in range(n)]
 # performing a query for each element of the dataset
 for rk in rand_keys:
  bloom.get(rk)
 return float(bloom.false_positive)/n*100.0
If we fix m = 10000 and n = 1000, according to the equations above, we have that the value of k which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:
# testing the filter
m = 10000
n = 1000
k = range(1,64)
perc = [bloomTest(m,n,kk) for kk in k] # k is varying

# plotting the result of the test
from pylab import plot,show,xlabel,ylabel
plot(k,perc,'--ob',alpha=.7)
ylabel('false positive %')
xlabel('k')
show()
The result of the test should be as follows


Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

Monday, January 14, 2013

Box-Muller Transformation

The Box-Muller transform is a method for generating normally distributed random numbers from uniformly distributed random numbers. The Box-Muller transformation can be summarized as follows, suppose u1 and u2 are independent random variables that are uniformly distributed between 0 and 1 and let


then z1 and z2 are independent random variables with a standard normal distribution. Intuitively, the transformation maps each circle of points around the origin to another circle of points around the origin where larger outer circles are mapped to closely-spaced inner circles and inner circles to outer circles.
Let's see a Python snippet that implements the transformation:
from numpy import random, sqrt, log, sin, cos, pi
from pylab import show,hist,subplot,figure

# transformation function
def gaussian(u1,u2):
  z1 = sqrt(-2*log(u1))*cos(2*pi*u2)
  z2 = sqrt(-2*log(u1))*sin(2*pi*u2)
  return z1,z2

# uniformly distributed values between 0 and 1
u1 = random.rand(1000)
u2 = random.rand(1000)

# run the transformation
z1,z2 = gaussian(u1,u2)

# plotting the values before and after the transformation
figure()
subplot(221) # the first row of graphs
hist(u1)     # contains the histograms of u1 and u2 
subplot(222)
hist(u2)
subplot(223) # the second contains
hist(z1)     # the histograms of z1 and z2
subplot(224)
hist(z2)
show()
The result should be similar to the following:


In the first row of the graph we can see, respectively, the histograms of u1 and u2 before the transformation and in the second row we can see the values after the transformation, respectively z1 and z2. We can observe that the values before the transformation are distributed uniformly while the histograms of the values after the transformation have the typical Gaussian shape.

Saturday, November 17, 2012

First steps with networkx

One of my favorite topics is the study of structures and, inspired by the presentation of Jacqueline Kazil and Dana Bauer at PyCon US, I started to use networkx in order to analyze some networks. This library provides a lot facilities for the creation, the visualization and the mining of structured data. So, I decided to write this post that shows the first steps to start with it. We will see how to load a network from the gml format and how to prune the network in order to visualize only the nodes with a high degree. In the following examples the coappearance network of characters in the novel Les Miserables, freely available here, will be used. In this network each node represents a character and the connection between two characters represent the coappearance in the same chapter.

Let's start with the snippets. We can load and visualize the network with the following code:
# read the graph (gml format)
G = nx.read_gml('lesmiserables.gml',relabel=True)

# drawing the full network
figure(1)
nx.draw_spring(G,node_size=0,edge_color='b',alpha=.2,font_size=10)
show()
This should be the result:


It's easy to see that the graph is not really helpful. Most of the details of the network are still hidden and it's impossible to understand which are the most important nodes. Let's plot an histogram of the number of connections per node:
# distribution of the degree
figure(2)
d = nx.degree(G)
hist(d.values(),bins=15)
show()
The result should be as follows:


Looking at this histogram we can see that only few characters have more than ten connections. Then, we decide to visualize only them:
def trim_nodes(G,d):
 """ returns a copy of G without 
     the nodes with a degree less than d """
 Gt = G.copy()
 dn = nx.degree(Gt)
 for n in Gt.nodes():
  if dn[n] <= d:
   Gt.remove_node(n)
 return Gt

# drawing the network without
# nodes with degree less than 10
Gt = trim_nodes(G,10)
figure(3)
nx.draw(Gt,node_size=0,node_color='w',edge_color='b',alpha=.2)
show()
In the graph below we can see the final result of the analysis. This time the graph makes us able to observe which are the most relevant characters and how they are related to each other according to their coappearance through the chapters.

Friday, October 12, 2012

Visualizing correlation matrices

The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The function corrcoef provided by numpy returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are variables and whose columns are observations. Each element of the matrix R represents the correlation between two variables and it is computed as

where cov(X,Y) is the covariance between X and Y, while σX and σY are the standard deviations. If N is number of variables then R is a N-by-N matrix. Then, when we have a large number of variables we need a way to visualize R. The following snippet uses a pseudocolor plot to visualize R:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks

# generating some uncorrelated data
data = rand(10,100) # each row of represents a variable

# creating correlation between the variables
# variable 2 is correlated with all the other variables
data[2,:] = sum(data,0)
# variable 4 is correlated with variable 8
data[4,:] = log(data[8,:])*0.5

# plotting the correlation matrix
R = corrcoef(data)
pcolor(R)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
show()
The result should be as follows:


As we expected, the correlation coefficients for the variable 2 are higher than the others and we observe a strong correlation between the variables 4 and 8.

Saturday, September 29, 2012

Weighted random choice

Weighted random choice makes you able to select a random value out of a set of values using a distribution specified though a set of weights. So, given a list we want to pick randomly some elements from it but we need that the chances to pick a specific element is defined using a weight. In the following code we have a function that implements the weighted random choice mechanism and an example of how to use it:
from numpy import cumsum, sort, sum, searchsorted
from numpy.random import rand
from pylab import hist,show,xticks

def weighted_pick(weights,n_picks):
 """
  Weighted random selection
  returns n_picks random indexes.
  the chance to pick the index i 
  is give by the weight weights[i].
 """
 t = cumsum(weights)
 s = sum(weights)
 return searchsorted(t,rand(n_picks)*s)

# weights, don't have to sum up to one
w = [0.1, 0.2, 0.5, 0.5, 1.0, 1.1, 2.0]

# picking 10000 times
picked_list = weighted_pick(w,10000)

# plotting the histogram
hist(picked_list,bins=len(w),normed=1,alpha=.8,color='red')
show()
The code above plots the distribution of the selected indexes:


We can observe that the chance to pick the element i is proportional to the weight w[i].

Friday, September 14, 2012

Boxplot with matplotlib

A boxplot (also known as a box-and-whisker diagram) is a way of summarizing a set of data measured on an interval scale. In this post I will show how to make a boxplot with pylab using a dataset that contains the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971. The data was extracted from the Hipel-McLeod Time Series Datasets Collection and you can download it from here in the matlab format.
Let's make a box plot of the monthly distribution of chicken pox cases:
from pylab import *
from scipy.io import loadmat

NYCdiseases = loadmat('NYCDiseases.mat') # it's a matlab file

# multiple box plots on one figure
# Chickenpox cases by month
figure(1)
# NYCdiseases['chickenPox'] is a matrix 
# with 30 rows (1 per year) and 12 columns (1 per month)
boxplot(NYCdiseases['chickenPox'])
labels = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
xticks(range(1,13),labels, rotation=15)
xlabel('Month')
ylabel('Chickenpox cases')
title('Chickenpox cases in NYC 1931-1971')
The result should be as follows:


On each box, the central mark is the median, the edges of the box are the lower hinge (defined as the 25th percentile) and the upper hinge (the 75th percentile), the whiskers extend to the most extreme data points not considered outliers, these ones are plotted individually.
Using the graph we can compare the range and distribution of the chickenpox cases for each month. We can observe that March and April are the month with the highest number of cases but also the ones with the greatest variability. We can compare the distribution of the three diseases in the same way:
# building the data matrix
data = [NYCdiseases['measles'], 
        NYCdiseases['mumps'], NYCdiseases['chickenPox']]

figure(2)
boxplot(data)
xticks([1,2,3],('measles','mumps','chickenPox'), rotation=15)
ylabel('Monthly cases')
title('Contagious childhood disease in NYC 1931-1971')

show()
And this is the result:


Here, we can observe that the chicken pox distribution has the median higher than the other diseases. The mumps distribution seems to have small variability compared to the other ones and the measles distribution has a low median but a very high number of outliers.

Thursday, August 16, 2012

Kernel Density Estimation with scipy

This post continues the last one where we have seen how to how to fit two types of distribution functions (Normal and Rayleigh). This time we will see how to use Kernel Density Estimation (KDE) to estimate the probability density function. KDE is a non-parametric technique for density estimation in which a known density function (the kernel) is averaged across the observed data points to create a smooth approximation. Given the non-parametrica nature of KDE, the main estimator has not a fixed functional form but only it depends upon all the data points we used for the estimation. Let's see the snippet:
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist

# creating data with two peaks
sampD1 = norm.rvs(loc=-1.0,scale=1,size=300)
sampD2 = norm.rvs(loc=2.0,scale=0.5,size=300)
samp = hstack([sampD1,sampD2])

# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(samp)

# plotting the result
x = linspace(-5,5,100)
plot(x,my_pdf(x),'r') # distribution function
hist(samp,normed=1,alpha=.3) # histogram
show()
The result should be as follows:

Friday, July 20, 2012

Distribution fitting with scipy

Distribution fitting is the procedure of selecting a statistical distribution that best fits to a dataset generated by some random process. In this post we will see how to fit a distribution using the techniques implemented in the Scipy library.
This is the first snippet:
from scipy.stats import norm
from numpy import linspace
from pylab import plot,show,hist,figure,title

# picking 150 of from a normal distrubution
# with mean 0 and standard deviation 1
samp = norm.rvs(loc=0,scale=1,size=150) 

param = norm.fit(samp) # distribution fitting

# now, param[0] and param[1] are the mean and 
# the standard deviation of the fitted distribution
x = linspace(-5,5,100)
# fitted distribution
pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = norm.pdf(x)

title('Normal distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()
The result should be as follows


In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar.
Let's do the same with a Rayleigh distribution:
from scipy.stats import norm,rayleigh

samp = rayleigh.rvs(loc=5,scale=2,size=150) # samples generation

param = rayleigh.fit(samp) # distribution fitting

x = linspace(5,13,100)
# fitted distribution
pdf_fitted = rayleigh.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = rayleigh.pdf(x,loc=5,scale=2)

title('Rayleigh distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()
The resulting plot:


As expected, the two distributions are very close.

Friday, May 4, 2012

Analyzing your Gmail with Matplotlib

Lately, I read this post about using Mathematica to analyze a Gmail account. I found it very interesting and I worked a with imaplib and matplotlib to create two of the graph they showed:
  • A diurnal plot, which shows the date and time each email was sent (or received), with years running along the x axis and times of day on the y axis.
  • And a daily distribution histogram, which represents the distribution of emails sent by time of day.
In order to plot those graphs I created three functions. The first one, retrieve the headers of the emails we want to analyze:
from imaplib import IMAP4_SSL
from datetime import date,timedelta,datetime
from time import mktime
from email.utils import parsedate
from pylab import plot_date,show,xticks,date2num
from pylab import figure,hist,num2date
from matplotlib.dates import DateFormatter

def getHeaders(address,password,folder,d):
 """ retrieve the headers of the emails 
     from d days ago until now """
 # imap connection
 mail = IMAP4_SSL('imap.gmail.com')
 mail.login(address,password)
 mail.select(folder) 
 # retrieving the uids
 interval = (date.today() - timedelta(d)).strftime("%d-%b-%Y")
 result, data = mail.uid('search', None, 
                      '(SENTSINCE {date})'.format(date=interval))
 # retrieving the headers
 result, data = mail.uid('fetch', data[0].replace(' ',','), 
                         '(BODY[HEADER.FIELDS (DATE)])')
 mail.close()
 mail.logout()
 return data
The second one, make us able to make the diurnal plot:
def diurnalPlot(headers):
 """ diurnal plot of the emails, 
     with years running along the x axis 
     and times of day on the y axis.
 """
 xday = []
 ytime = []
 for h in headers: 
  if len(h) > 1:
   timestamp = mktime(parsedate(h[1][5:].replace('.',':')))
   mailstamp = datetime.fromtimestamp(timestamp)
   xday.append(mailstamp)
   # Time the email is arrived
   # Note that years, month and day are not important here.
   y = datetime(2010,10,14, 
     mailstamp.hour, mailstamp.minute, mailstamp.second)
   ytime.append(y)

 plot_date(xday,ytime,'.',alpha=.7)
 xticks(rotation=30)
 return xday,ytime
And this is the function for the daily distribution histogram:
def dailyDistributioPlot(ytime):
 """ draw the histogram of the daily distribution """
 # converting dates to numbers
 numtime = [date2num(t) for t in ytime] 
 # plotting the histogram
 ax = figure().gca()
 _, _, patches = hist(numtime, bins=24,alpha=.5)
 # adding the labels for the x axis
 tks = [num2date(p.get_x()) for p in patches] 
 xticks(tks,rotation=75)
 # formatting the dates on the x axis
 ax.xaxis.set_major_formatter(DateFormatter('%H:%M'))
Now we got everything we need to make the graphs. Let's try to analyze the outgoing mails of last 5 years:
print 'Fetching emails...'
headers = getHeaders('[email protected]',
                      'ofcourseiamsupersexy','inbox',365*5)

print 'Plotting some statistics...'
xday,ytime = diurnalPlot(headers)
dailyDistributioPlot(ytime)
print len(xday),'Emails analysed.'
show()
The result would appear as follows



We can analyze the outgoing mails just using selecting the folder '[Gmail]/Sent Mail':
print 'Fetching emails...'
headers = getHeaders('[email protected]',
                     'ofcourseiamsupersexy','[Gmail]/Sent Mail',365*5)

print 'Plotting some statistics...'
xday,ytime = diurnalPlot(headers)
dailyDistributioPlot(ytime)
print len(xday),'Emails analysed.'
show()
And this is the result: