The Glowing Python: NLP

Showing posts with label NLP. Show all posts

Wednesday, November 11, 2020

Visualize the Dictionary of Obscure Words with T-SNE

I recently published on a wrapper around The Dictionary of Obscure Words (originally from this website http://phrontistery.info) for Python and in this post we'll see how to create a visualization to highlight few entries from the dictionary using the dimensionality reduction technique called T-SNE. The dictionary is available on github at this address https://github.com/JustGlowing/obscure_words and can be installed as follows:

pip install git+https://github.com/JustGlowing/obscure_words

We can now import the dictionary and create a vectorial representation of each word:

import matplotlib.pyplot as plt
import numpy as np
from obscure_words import load_obscure_words
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE

obscure_dict = load_obscure_words()
words = np.array(list(obscure_dict.keys()))
definitions = np.array(list(obscure_dict.values()))

vectorizer = TfidfVectorizer(stop_words=None)
X = vectorizer.fit_transform(definitions)

projector = TSNE(random_state=0)
XX = projector.fit_transform(X)

In the snippet above, we compute a Tf-Idf representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use T-SNE. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

def textscatter(x, y, text, k=10):
    X = np.array([x, y]).T
    clustering = KMeans(n_clusters=k)
    scaler = StandardScaler()
    clustering.fit(scaler.fit_transform(X))
    centers = scaler.inverse_transform(clustering.cluster_centers_)
    selected = np.argmin(pairwise_distances(X, centers), axis=0)
    plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05)
    for i in selected:
        plt.text(x[i], y[i], text[i], fontsize=10)

plt.figure(figsize=(16, 16))
textscatter(XX[:, 0], XX[:, 1], 
            [w+'\n'+d for w, d in zip(words, definitions)], 20)
plt.show()

In the function textscatter we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:

(click on the figure to see the entire chart)

Sunday, April 5, 2020

What makes a word beautiful?

What makes a word beautiful? Answering this question is not easy because of the inherent complexity and ambiguity in defining what it means to be beautiful. Let's tackle the question with a quantitative approach introducing the Aesthetic Potential, a metric that aims to quantify the beaty of a word w as follows:

where w⁺ is a word labelled as beautifu, w^- as ugly and the function s is a similarity function between two words. In a nutshell, AP is the difference of the average similarity to beautiful words minus the average similarity to ugly words. This metric is positive for beautiful words and negative for ugly ones.
Before we can compute the Aesthetic Potential we need a similarity function s and a set of words labeled as beautiful and ugly. The similarity function that we will use considers the similarity of two words as the maximum Lin similarity between all the synonyms in WordNet of the two words in input (I will not introduce WordNet or the Lin similarity for brevity, but the curious reader is invited to follow the links above). Here's the Python implementation:

import numpy as np
from itertools import product
from nltk.corpus import wordnet, wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

def similarity(word1, word2):
    """
    returns the similarity between word1 and word2 as the maximum
    Lin similarity between all the synsets of the two words.
    """
    syns1 = wordnet.synsets(word1)
    syns2 = wordnet.synsets(word2)
    sims = []
    for sense1, sense2 in product(syns1, syns2):
        if sense1._pos == sense2._pos and not sense1._pos in ['a', 'r', 's']:
            d = wordnet.lin_similarity(sense1, sense2, brown_ic)
            sims.append(d)            
    if len(sims) > 0 or not np.all(np.isnan(sims)):        
        return np.nanmax(sims)
    return 0 # no similarity

print('s(cat, dog) =', similarity('cat', 'dog'))
print('s(cat, bean) = ', similarity('cat', 'bean'))
print('s(coffee, bean) = ', similarity('coffee', 'bean'))

s(cat, dog) = 0.8768009843733973
s(cat, bean) = 0.3079964716744931
s(coffee, bean) = 0.788150820826125

This function returns a value between 0 and 1. High values indicate that the two words are highly similar and low values indicate that there's no similarity. Looking at the output of the function three pairs of test words we note that the function considers "cat" and "dog" fairly similar while "dog" and "bean" not similar. Finally, "coffee" and "bean" are considered similar but not as similar as "cat" and "dog".
Now we need some words labeled as beautiful and some as ugly. Here I propose two lists of words inspired by the ones used in (Jacobs, 2017) for the German language:

beauty = ['amuse',  'art', 'attractive',
          'authentic', 'beautiful', 'beauty',
          'bliss', 'cheerful', 'culture',
          'delight', 'emotion', 'enjoyment',
          'enthusiasm', 'excellent', 'excited',
          'fascinate', 'fascination', 'flower',
          'fragrance', 'good', 'grace',
          'graceful', 'happy', 'heal',
          'health', 'healthy', 'heart',
          'heavenly', 'hope', 'inspire',
          'light', 'joy', 'love',
          'lovely', 'lullaby', 'lucent',
          'loving', 'luck', 'magnificent',
          'music', 'muse', 'life',
          'paradise', 'perfect', 'perfection',
          'picturesque', 'pleasure',
          'poetic', 'poetry', 'pretty',
          'protect', 'protection',
          'rich', 'spring', 'smile',
          'summer', 'sun', 'surprise',          
          'wealth', 'wonderful']

ugly = ['abuse', 'anger', 'imposition', 'anxiety',
        'awkward', 'bad', 'unlucky', 'blind',
        'chaotic', 'crash', 'crazy',
        'cynical', 'dark', 'disease',
        'deadly', 'decrepit', 'death',
        'despair', 'despise', 'disgust',
        'dispute', 'depression', 'dull',
        'evil', 'fail', 'hate',
        'hideous', 'horrible', 'horror',
        'haunted', 'illness', 'junk',
        'kill', 'less',
        'malicious', 'misery', 'murder',
        'nasty', 'nausea', 'pain',
        'piss', 'poor', 'poverty',
        'puke', 'punishment', 'rot',
        'scam', 'scare', 'shame',
        'spoil', 'spite', 'slaughter',
        'stink', 'terrible', 'trash',
        'trouble', 'ugliness', 'ugly',
        'unattractive', 'virus']

A remark is necessary here. The AP strongly depends on these two lists and the fact that I made them on my own strongly biases the results towards my personal preferences. If you're interested on a more general approach to label your data, the work published by Westbury et all in 2014 is a good place to start.
We now have all the pieces to compute our Aesthetic Potential:

def aesthetic_potential(word, beauty, ugly):
    """
    returns the aesthetic potential of word
    beauty and ugly must be lists of words
    labelled as beautiful and ugly respectively
    """
    b = np.nanmean([similarity(word, w) for w in beauty])
    u = np.nanmean([similarity(word, w) for w in ugly])
    return (b - u)*100

print('AP(smile) =', aesthetic_potential('smile', beauty, ugly))
print('AP(conjuncture) =', aesthetic_potential('conjuncture', beauty, ugly))
print('AP(hassle) =', aesthetic_potential('hassle', beauty, ugly))

AP(smile) = 2.6615214570040195
AP(conjuncture) = -3.418813636728729e-299
AP(hassle) = -2.7675826881674497

It is a direct implementation of the equation introduced above, the only difference is that the result is multiplied by 100 to have the metric in percentage for readability purposes. Looking at the results we see that the metric is positive for the word "smile", indicating that the word tends toward the beauty side. It's negative for "hassle", meaning it tends to the ugly side. It's 0 for "conjuncture", meaning that we can consider it a neutral word. To better understand these results we can compute the metric for a set of words and plot it agains the probability of a value of the metric:

test_words = ['hate', 'rain',
         'earth', 'love', 'child',
         'sun', 'patience',
         'coffee', 'regret',
         'depression', 'obscure', 'bat', 'woman',
         'dull', 'nothing', 'disillusion',
         'abort', 'blurred', 'cruelness', #'hassle',
         'stalking', 'relevance',
         'conjuncture', 'god', 'moon',
         'humorist', 'idea', 'poisoning']

ap = [aesthetic_potential(w.lower(), beauty, ugly) for w in test_words]

from scipy.stats import norm
import matplotlib.pyplot as plt
from matplotlib.colors import to_hex, LinearSegmentedColormap, Normalize
%matplotlib inline

p_score = norm.pdf(ap, loc=0.0, scale=0.7) #params estimated on a larger sample
p_score = p_score / p_score.sum()

normalizer = Normalize(vmin=-10, vmax=10)
colors = ['crimson', 'crimson', 'silver', 'deepskyblue', 'deepskyblue']
cmap = LinearSegmentedColormap.from_list('beauty', colors=colors)

plt.figure(figsize=(8, 12))
plt.title('What makes a word beautiful?',
          loc='left', color='gray', fontsize=22)
plt.scatter(p_score, ap, c='gray', marker='.', alpha=.6)
for prob, potential, word in zip(p_score, ap, test_words):
    plt.text(prob, potential, word.lower(),
             fontsize=(np.log10(np.abs(potential)+2))*30, alpha=.8,
             color=cmap(normalizer(potential)))
plt.text(-0.025, 6, 'beautiful', va='center',
         fontsize=20, rotation=90, color='deepskyblue')
plt.text(-0.025, -6, 'ugly', va='center',
         fontsize=20, rotation=90, color='crimson')
plt.xlabel('P(Aesthetic Potential)', fontsize=20)
plt.ylabel('Aesthetic Potential', fontsize=20)
plt.gca().tick_params(axis='both', which='major', labelsize=14)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.show()

This chart gives us a better insight on the meaning of the values we just computed. We note that high probability values are around 0, hence most words in the vocabulary are neutral. Values above 2 and below -2 have a quite low probability, this tells us that words associated with these values have a strong Aesthetic Potential. From this chart we can see that the words "idea" and "sun" are considered beautiful while "hate" and "poisoning" are ugly (who would disagree with that :).

Wednesday, September 24, 2014

Text summarization with NLTK

The target of the automatic text summarization is to reduce a textual document to a summary that retains the pivotal points of the original document. The research about text summarization is very active and during the last years many summarization algorithms have been proposed.
In this post we will see how to implement a simple text summarizer using the NLTK library (which we also used in a previous post) and how to apply it to some articles extracted from the BBC news feed. The algorithm that we are going to see tries to extract one or more sentences that cover the main topics of the original document using the idea that, if a sentences contains the most recurrent words in the text, it probably covers most of the topics of the text. Here's the Python class that implements the algorithm:

from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
    """
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    """
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
    """ 
      Compute the frequency of each of word.
      Input: 
       word_sent, a list of sentences already tokenized.
      Output: 
       freq, a dictionary where freq[w] is the frequency of w.
    """
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in freq.keys():
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
    """
      Return a list of n sentences 
      which represent the summary of text.
    """
    sents = sent_tokenize(text)
    assert n <= len(sents)
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)

The FrequencySummarizer tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don't contain much information, or words that occur only few times. And finally, the sentences are ranked according to the frequency of the words they contain and the top sentences are selected for the final summary.

To test the summarizer, let's create a function that extract the natural language from a html page using BeautifulSoup:

import urllib2
from bs4 import BeautifulSoup

def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = urllib2.urlopen(url).read().decode('utf8')
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text

We can finally apply our summarizer on a set of articles extracted from the BBC news feed:

feed_xml = urllib2.urlopen('http://feeds.bbci.co.uk/news/rss.xml').read()
feed = BeautifulSoup(feed_xml.decode('utf8'))
to_summarize = map(lambda p: p.text, feed.find_all('guid'))

fs = FrequencySummarizer()
for article_url in to_summarize[:5]:
  title, text = get_only_text(article_url)
  print '----------------------------------'
  print title
  for s in fs.summarize(text, 2):
   print '*',s

And here are the results:

----------------------------------
BBC News - Scottish independence: Campaigns seize on Scotland powers pledge
* Speaking ahead of a visit to apprentices at an engineering firm in 
Renfrew, Deputy First Minister Nicola Sturgeon said: Only a 'Yes' vote will 
ensure we have full powers over job creation - enabling us to create more 
and better jobs across the country.
* Asked if the move smacks of panic, Mr Alexander told BBC Breakfast: 
I don't think there's any embarrassment about placing policies on the 
front page of papers with just days two go.
----------------------------------
BBC News - US air strike supports Iraqi troops under attack
* Gabriel Gatehouse reports from the front line of Peshmerga-held territory 
in northern Iraq The air strike south-west of Baghdad was the first taken as 
part of our expanded efforts beyond protecting our own people and humanitarian 
missions to hit Isil targets as Iraqi forces go on offence, as outlined in the 
president's speech last Wednesday, US Central Command said.
* But Iran's Supreme Leader Ayatollah Ali Khamenei said on Monday that the US 
had requested Iran's co-operation via the US ambassador to Iraq.
----------------------------------
BBC News - Passport delay victims deserve refund, say MPs
* British adult passport costs Normal service - £72.50 Check  Send - 
Post Office staff check application correct and it is sent by Special Delivery 
- £81.25 Fast-Track - Applicant attends Passport Office in person and passport 
delivered within one week - £103 Premium - Passport available for collection 
on same day applicant attends Passport Office - £128 In mid-June it announced 
that - for people who could prove they were booked to travel within seven days 
and had submitted passport applications more than three weeks earlier - there 
would be a free upgrade to its fast-track service.
* The Passport Office has since cut the number of outstanding applications to 
around 90,000, but the report said: A number of people have ended up 
out-of-pocket due to HMPO's inability to meet its service standard.
----------------------------------
BBC News - UK inflation rate falls to 1.5%
* Howard Archer, chief UK and European economist at IHS Global Insight, 
said: August's muted consumer price inflation is welcome news for consumers' 
purchasing power as they currently continue to be hampered by very 
low earnings growth.
* Consumer Price Index (CPI) inflation fell to 1.5% from 1.6% in August, 
the Office for National Statistics said.
----------------------------------
BBC News - Thailand deaths: Police have 'number of suspects'
* The BBC's Jonathan Head, on Koh Tao, says police are focussing on the 
island's Burmese community BBC south-east Asia correspondent Jonathan Head 
said the police's focus on Burmese migrants would be quite controversial as 
Burmese people were often scapegoated for crimes in Thailand.
* By Jonathan Head, BBC south-east Asia correspondent The shocking death of 
the two young tourists has cast a pall over this scenic island resort Locals 
say they can remember nothing like it happening before.

Of course, the evaluation a text summarizer is not an easy task. But, from the results above we note that the summarizer often picked quoted text reported in the original article and that the sentences picked by the summarizer often represent decent insights if we consider the title of the article.

Tuesday, July 23, 2013

Combining Scikit-Learn and NTLK

In Chapter 6 of the book Natural Language Processing with Python there is a nice example where is showed how to train and test a Naive Bayes classifier that can identify the dialogue act types of instant messages. Th classifier is trained on the NPS Chat Corpus which consists of over 10,000 posts from instant messaging sessions labeled with one of 15 dialogue act types.
The implementation of the Naive Bayes classifier used in the book is the one provided in the NTLK library. Here we will see how to use use the Support Vector Machine (SVM) classifier implemented in Scikit-Learn without touching the features representation of the original example.
Here is the snippet to extract the features (equivalent to the one in the book):

import nltk

def dialogue_act_features(sentence):
    """
        Extracts a set of features from a message.
    """
    features = {}
    tokens = nltk.word_tokenize(sentence)
    for t in tokens:
        features['contains(%s)' % t.lower()] = True    
    return features

# data structure representing the XML annotation for each post
posts = nltk.corpus.nps_chat.xml_posts() 
# label set
cls_set = ['Emotion', 'ynQuestion', 'yAnswer', 'Continuer',
'whQuestion', 'System', 'Accept', 'Clarify', 'Emphasis', 
'nAnswer', 'Greet', 'Statement', 'Reject', 'Bye', 'Other']
featuresets = [] # list of tuples of the form (post, features)
for post in posts: # applying the feature extractor to each post
 # post.get('class') is the label of the current post
 featuresets.append((dialogue_act_features(post.text),cls_set.index(post.get('class'))))

After the feature extraction we can split the data we obtained in training and testing set:

from random import shuffle
shuffle(featuresets)
size = int(len(featuresets) * .1) # 10% is used for the test set
train = featuresets[size:]
test = featuresets[:size]

Now we can instantiate the model that implements classifier using the scikitlearn interface provided by NLTK and train it:

from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
# SVM with a Linear Kernel and default parameters 
classif = SklearnClassifier(LinearSVC())
classif.train(train)

In order to use the batch_classify method provided by scikitlearn we have to organize the test set in two lists, the first one with the train data and the second one with the target labels:

test_skl = []
t_test_skl = []
for d in test:
 test_skl.append(d[0])
 t_test_skl.append(d[1])

Then we can run the classifier on the test set and print a full report of its performances:

# run the classifier on the train test
p = classif.batch_classify(test_skl)
from sklearn.metrics import classification_report
# getting a full report
print classification_report(t_test_skl, p, labels=list(set(t_test_skl)),target_names=cls_set)

The report will look like this:

              precision    recall  f1-score   support

    Emotion       0.83      0.85      0.84       101
 ynQuestion       0.78      0.78      0.78        58
    yAnswer       0.40      0.40      0.40         5
  Continuer       0.33      0.15      0.21        13
 whQuestion       0.78      0.72      0.75        50
     System       0.99      0.98      0.98       259
     Accept       0.80      0.59      0.68        27
    Clarify       0.00      0.00      0.00         6
   Emphasis       0.59      0.59      0.59        17
    nAnswer       0.73      0.80      0.76        10
      Greet       0.94      0.91      0.93       160
  Statement       0.76      0.86      0.81       311
     Reject       0.57      0.31      0.40        13
        Bye       0.94      0.68      0.79        25
      Other       0.00      0.00      0.00         1

avg / total       0.84      0.85      0.84      1056