Text classification-php-v4

Who am I ?
Glenn De Backer (twitter: @glenndebacker)
Web developer @ Dx-Solutions
32 years old originally from Bruges, now
living in Meulebeke
Interested in machine learning, (board) games,
electronics and have a bit of a creative bone…
Blog: http://www.simplicity.be

What will we cover today ?
What is text classiﬁcation
NLP terminology
Bayes theorem
Some PHP code

What is text classiﬁcation ?
Text classiﬁcation is the process of
assigning classes to documents
This can be done manually or by using
machine learning (algorithmically)
Today`s talk will be about classifying text
using a supervised machine learning
algorithm: Naive bayes

Supervised vs unsupervised
machine learning ?
Supervised means in simple terms
that we need to feed our
algorithm examples of data and
what they represent 
 
Free gift card -> spam 
The server is down -> ham
Unsupervised means that we work
with algorithms that ﬁnds hidden
structure in unlabelled data. For
example clustering documents

Some possible use cases
Spam detection (classic)
Assigning categories, topics, genres, subjects, …
Determine authorship
Gender classiﬁcation
Sentiment analysis
Identifying languages
…

Personal project 
Nieuws zonder politiek

Personal project 
Nieuws zonder politiek
Fun project from 2010
Related to the 589 days with no elected government.
We had a lot of political related non-news items that
I wanted to filter out as an experiment.
News aggregator that fetched news from different
flemish newspapers
Classified those items into political and non political
news

Personal project 
Wuk zeg je ?

Personal project 
Wuk zeg je ?
Fun project released at the end of 2015
Inspired by a contest of the province of
West Flanders to ﬁnd foreign words that
sounded West-Flemish
Can recognise the West-Flemish dialect… but
also Dutch, French and English
Uses character n-grams instead of words

Tokenization
Before any real text processing can be done we need to
execute the task of tokenization.
Tokenisation is the task of dividing text into words,
sentences, symbols or other elements called tokens.
They often talk about features instead of tokens.

N-grams
N-gram are sequences of tokens of
length N
Can be words, combination of words,
characters, … .
Depending on the size it also sometimes
called a unigram (1 item), bigram (2
items) or a trigram (3 items).
Character n-grams are very suited for
language classiﬁcation

Stop words
Are words (or features) that
are particular common in a text
corpus
for example the, and, on, in, …
Are considered uninformative
A list of stopwords is used to
remove or ignore words from
the document we are processing
Optional but recommended

Stemming
Stemming is the process of reducing words to their word stem,
base or root.
Not a required step but it can certainly help in reducing the
number of features and improving the task of classifying text
(e.g. speed or quality)
The most used is the Porter stemmer which contains support for
English, French, Dutch, …

Bag Of Words (BOW) model
Is a simple representation
of text features
Can be words, combination
of words, sounds, … .
A Bow model contains a
vocabulary including a
vocabulary count

Training / test set
A training set is just a collection of a
labeled data used for classifying data. 
 
Free gift card -> spam 
The server is down -> ham
A test set is simply to test the accuracy
of our classiﬁer

A typical ﬂow
PHP is a server-side
scripting language designed
for web development

A typical ﬂow
PHP : 1
server-side : 1
scripting : 1 
language : 1
designed : 1
web : 1
development : 1

Some history trivia
Discovered by a British
minister Thomas Bayes in
1740.
Rediscovered independently
by a French scholar Piere
Simon Laplace who gave it
its modern mathematical
form.
Alan Turing used it to decode
the German Enigma Cipher
which had a big inﬂuence on
the outcome of World War 2.

Bayes theorem
In probability theory or statistics Bayes
theorem describes the probability of an
event based on conditions that might
relate to that event.
E.g. how probable it is that an article is
about sports (and that based on certain
words that the article contains).

Naive Bayes
Naive Bayes classiﬁers are a family of
simple probabilistic classiﬁers based on
applying Bayes theorem
The naive part is the fact that it
strongly assume independence between
features (words in our case)

Bayes and text classiﬁcation
We can modify the standard Bayes formule as: 
 
 
Where C is the class…
and D is the document
We can drop P(D) as this is a constant in this
case. This is a very common thing to do when
using Naive Bayes for classiﬁcation problems.

Probability of a class
Where Dc is the number of documents in
our training set that have this class…
and Dt is the total number of documents
in our training set

Probability of a class
given a document
Where wx are the words of our text
What is the (joint) probability of word 1,
word 2, word 3, … given our class

Enough abstract
formulas for today,
2 simpliﬁed examples

We have the following data*
word good bad total
server 5 6 11
crashed 2 14 16
updated 9 1 10
new 8 1 9
total 24 22 46
* in reality your data will contain a lot more words and higher counts

word good bad total
server 5 6 11
crashed 2 14 16
… … … …
total 24 22 46
The server has crashed
(We applied a stopword ﬁlter that removes the words “the” and “has”)

word good bad total
server 5 6 11
updated 9 1 10
new 8 1 9
… … … …
total 24 22 46
The new server is updated
(We applied a stopword ﬁlter that removes the words “the” and “is”)

NlpTools
NlpTools is a library for natural language
processing written in PHP
Classes for classifying, tokenizing,
stemming, clustering, topic modeling, … .
Released under the WTFL license (Do
what you want)

Tokenizing a sentence
// text we will be converting into tokens
$text = "PHP is a server side scripting language.";
// initialize Whitespace and punctuation tokenizer
$tokenizer = new WhitespaceTokenizer();
// print array of tokens
print_r($tokenizer->tokenize($text));

Dealing with stop words
// text we will be converting into tokens
$text = "PHP is a server side scripting language.";
// deﬁne a list of stop words
$stop = new StopWords(array("is", "a", "as"));
// initialize Whitespace tokenizer
// init token document
$doc = new TokensDocument($tokenizer->tokenize($text));
// apply our stopwords
$doc->applyTransformation($stop);
// print ﬁltered tokens
print_r($doc->getDocumentData());

Stemming words
// init PorterStemmer
$stemmer = new PorterStemmer();
// stemming variants of upload
printf("%sn", $stemmer->stem("uploading"));
printf("%sn", $stemmer->stem("uploaded"));
printf("%sn", $stemmer->stem("uploads"));
// stemming variants of delete
printf("%sn", $stemmer->stem("delete"));
printf("%sn", $stemmer->stem("deleted"));
printf("%sn", $stemmer->stem("deleting"));

Classiﬁcation (training 1/2)
$training = array(
array('us','new york is a hell of a town'),
array('us','the statue of liberty'),
array('us','new york is in the united states'),
array('uk','london is in the uk'),
array('uk','the big ben is in london’),
…
);
// hold our training documents
$trainingSet = new TrainingSet();
// our tokenizer
// will hold the features we will be working
$features = new DataAsFeatures();

Classiﬁcation (training 2/2)
// iterate over training array
foreach ($training as $trainingDocument){
// add to our training set
$trainingSet->addDocument(
// class
$trainingDocument[0],
// document
new TokensDocument($tokenizer->tokenize($trainingDocument[1]))
);
}
// train our Naive Bayes Model
$bayesModel = new FeatureBasedNB();
$bayesModel->train($features, $trainingSet);

Classification (classifying)
$testSet = array(
array('us','i want to see the statue of liberty'),
array('uk','i saw the big ben yesterday’),
…
);
// init our Naive Bayes Class using the features and our model
$classifier = new MultinomialNBClassifier($features, $bayesModel);
// iterate over our test set
foreach ($testSet as $testDocument){
// predict our sentence
$prediction = $classifier->classify(
array('new york','us'), // the classes that can be predicted
new TokensDocument($tokenizer->tokenize($testDocument[1])) // the sentence
);
printf("sentence: %s | class: %s | predicted: %sn”,
$testDocument[1], $testDocument[0], $prediction );
}

Some tips
It is a best practice to split your data in a training and test
set instead of training on your whole dataset!
If you train your classifier against the whole dataset it can
happen that it will be very accurate on the dataset but
performs badly on unseen data, this is also called overfitting
in machine learning.
There isn’t a best split but 80-20 (Pareto principle) or 70-30
are safe ratio’s.
The numbers tells the tale! There are multiple ways of telling
how accurate your classifier performs but precision and recall
are a good start ! - http://www.kdnuggets.com/faq/
precision-recall.html

Some online PHP resources
http://www.php-nlp-tools.com/ - The
homepage of NlpTools
http://www.phpir.com - Contains a lot of
tutorials regarding information retrieval in
PHP
https://github.com/camspiers/statistical-
classiﬁer - An alternative Bayes Classiﬁer but
also supports SVM

Reading material
Code examples written in Java and Python but concepts
can easily be applied in other languages…

PHP NLP projects released
as open source
php-dutch-stemmer: is a PHP class that stems Dutch
words. Based on Porters algorithm.  
 
https://github.com/simplicitylab/php-dutch-stemmer
php-luhn-summarize: is a class that provides a basic
implementation of Luhn’s algorithm. This algorithm
can automatically create a summary of a given text.  
 
https://github.com/simplicitylab/php-luhn-summarize

http://www.slideshare.net/GlennDeBacker
https://github.com/simplicitylab/Talks
https://joind.in/talk/0d9b0

Text classification-php-v4

More Related Content

What's hot (20)

Similar to Text classification-php-v4 (20)

Recently uploaded (20)

Text classification-php-v4