SlideShare a Scribd company logo
University College London
Masters Thesis
Exploration of a self adaptive topic
engine
Author:
Maliththa S. S. Bulathwela
Institution:
University College London
Supervisors:
Prof. John Shawe-Taylor
Institution:
University College London
Dr. Martin Goodson Qubit Digital Inc.
A thesis submitted in partial fulfilment of the requirements
for the degree of Master of Science
in
Computational Statistics and Machine Learning
Department of Computer Science
September 2014
Declaration of Authorship
I, Maliththa S. S. Bulathwela, declare that this thesis titled, ’Exploration of a self
adaptive topic engine’ and the work presented in it are my own. I confirm that:
This report is submitted as part requirement for the MSc Degree in Computational
Statistics and Machine Learning at University College London.
It is substantially the result of my own work except where explicitly indicated in
the text.
The report may be freely copied and distributed provided the source is explicitly
acknowledged.
Signed:
Date:
i
“Our language is the reflection of ourselves. A language is an exact reflection of the
character and growth of its speakers.”
Cesar Chavez
UNIVERSITY COLLEGE LONDON
Abstract
Faculty of Engineering
Department of Computer Science
Master of Science
Exploration of a self adaptive topic engine
by Maliththa S. S. Bulathwela
The primary objective of this work is to build a reliable solution that can extract in-
sight out of customer feedback data (Natural Language). The problem was initially ad-
dressed using a topic classifier built using supervised Support Vector Machines (SVM).
As the initial data was classified using Amazon Mechanical Turk, some trust modeling
work from literature is adapted to develop heuristics that enhances the reliability of
the dataset. The SVM models were built with different pre-processing specifications
and bigram/unigram features to empirically assess the best choice. The best models
obtained 93% accuracy with the Qubit dataset when evaluated using Hamming Loss
based misclassification error. The latter part of the experiment was directed towards
detecting emerging topics from the dataset, as the majority of the labeled topics did not
belong to pre-defined topics. Latent Dirichlet Allocation (LDA) model was used to fulfill
this task. As there was a labeled dataset at already, these labels were used to tune the
model parameters automatically. In a 4 phase experiments, initially it was assessed how
many of the labeled topics can be detected using LDA model. 4/10 topics were detected
in this phase. In phase 2, a simulation experiment was launched to evaluate if labeled
topics can be detected if they were introduced as emerging topics, results showed that
the model could detect 6/10 topics amongst 10 datasets. Using the statistical qualities
of the consistent topic, a scoring function was built to assess topic consistency. Finally,
the parameters and the scoring function was used to detect an emerging topic out of the
unlabeled data. The detected topic suggested to be related to features of the business.
The observations deem to belong to the topic also seem to prove that the topic was
consistent. The results of the experiments are very promising. The finding also suggest
that further work is necessary to evaluate if self adapting topic engines can be built
using some techniques developed through the thesis.
Acknowledgements
I thank almighty god for blessing me with strength and wisdom to pursue my interests
and accomplish my goals.
I would like to thank my internal supervisor, Prof. John Shawe-Taylor, for being an
excellent supervisor guiding me through this journey. I would also like to thank my
co-supervisor, Dr. Martin Goodson for his invaluable supervision, guidance and support
extended to shape my skills in both theory and practice of machine learning. I thank
both of them for dedicating their time and efforts to support me.
I would also like to thank the Computer Science department at UCL, research team at
Qubit, and all my relatives and friends for strengthening me constantly
I would like to acknowledge the support provided by my family during the preparation
of my masters project.
iv
Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
Contents v
List of Figures ix
List of Tables x
Abbreviations xi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Analyzing what is captured . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Changing business landscape . . . . . . . . . . . . . . . . . . . . . 3
1.4 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Market Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Big Data and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Trust Modeling in Crowd sourced data . . . . . . . . . . . . . . . . . . . . 6
2.5 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1.1 Importance . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1.2 Spell Correction . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
v
Contents vi
2.5.1.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1.5 Stopword Removal . . . . . . . . . . . . . . . . . . . . . . 9
2.5.2 Feature Extraction: Vectorization . . . . . . . . . . . . . . . . . . 9
2.5.2.1 N-gram Bag of words . . . . . . . . . . . . . . . . . . . . 10
2.5.2.2 Term Frequency – Inverse Document Frequency (TF-IDF) 11
2.5.3 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3.1 Support Vector Machines (SVM) . . . . . . . . . . . . . . 12
2.5.3.2 Na¨ıve Bayes Classifier . . . . . . . . . . . . . . . . . . . . 13
2.5.3.3 Benefits of using SVM for text classification . . . . . . . 14
2.5.4 Latent Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.4.1 Latent Semantic Indexing (LSI) . . . . . . . . . . . . . . 16
2.5.4.2 Latent Dirichlet Allocation (LDA) . . . . . . . . . . . . . 16
2.5.4.3 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 18
2.5.4.4 Benefits of using LDA for text classification . . . . . . . . 19
2.6 Topic Consistency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Tools and Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.1 Amazon Mechanical Turk (MTurk) . . . . . . . . . . . . . . . . . . 20
2.7.2 Python 2.7 (Programming Language) . . . . . . . . . . . . . . . . 21
2.7.3 Special purpose libraries . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.3.1 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.3.2 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.3.3 Natural Language ToolKit (NLTK) . . . . . . . . . . . . 24
2.7.3.4 PyEnchant . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.3.5 Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Data Collection and Preprocessing 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Data Collection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Labelling the observations . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Initial Trust Modelling phase . . . . . . . . . . . . . . . . . . . . . 28
3.2.2.1 Topic Distribution of the dataset . . . . . . . . . . . . . . 29
3.2.2.2 Initial Sanity check . . . . . . . . . . . . . . . . . . . . . 30
3.2.2.3 Worker Scoring . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2.4 Experience of worker . . . . . . . . . . . . . . . . . . . . 38
3.2.2.5 Unique Feedback Scoring . . . . . . . . . . . . . . . . . . 39
3.2.2.6 Time of Day . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2.7 Replicated feedback scoring . . . . . . . . . . . . . . . . . 42
3.2.3 Selection and Rejection Criteria . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Directions for final data collection . . . . . . . . . . . . . . . . . . 44
3.2.5 The final strategy for data collection . . . . . . . . . . . . . . . . . 45
3.2.6 Final Data collection phase . . . . . . . . . . . . . . . . . . . . . . 45
3.3 The Final Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 The topic distribution of the final dataset . . . . . . . . . . . . . . 47
3.4 Preprocessing steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Text Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.2 Spell correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Contents vii
3.5 Preprocessing sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Preprocessing pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Topic Classification for Labelled Observatoins 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Implementation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Selecting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Selecting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.4 Final Process Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Unigram Model Results . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.3 Bigram Model Results . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.4 Unigram + Bigram Model Results . . . . . . . . . . . . . . . . . . 57
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Topic Detection Automation 60
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Experimentation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 First Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Second Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Third Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.4 Fourth Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Phase 1: Tuning LDA parameters . . . . . . . . . . . . . . . . . . 62
5.3.2 Phase 2: Simulating an emerging topic . . . . . . . . . . . . . . . . 63
5.3.3 Phase 3: Developing a Topic Consistency scoring function . . . . . 63
5.3.4 Phase 4: Final topic detection . . . . . . . . . . . . . . . . . . . . 64
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Phase 1 Results: All topics datasets . . . . . . . . . . . . . . . . . 66
5.4.3 Phase 2 Results: Individual Datasets . . . . . . . . . . . . . . . . . 69
5.4.3.1 Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3.3 Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3.4 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.4 Phase 4 results: None of the above dataset . . . . . . . . . . . . . 80
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Conclusion and Future Directions 84
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Trust modeling for cleaner data . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Topic classification with SVM . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 Topic Detection with LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 Potential Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5.1 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Contents viii
6.5.2 Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.3 Crowdsourcing ++ . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.3.1 Using crowdsourcing to label the emerging topics auto-
matically . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5.3.2 Automatically building the decision boundaries for the
new topic . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7 A change of perspective. . . CISCO 93
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.1 Career Development . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2.2 Industry Expertise . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2.3 Learn from the organizational culture . . . . . . . . . . . . . . . . 94
7.2.4 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 Background Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.1 Cisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.2 CIIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4 My role and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4.1 My project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4.1.1 Problem Background . . . . . . . . . . . . . . . . . . . . 98
7.4.1.2 The goal of the research project . . . . . . . . . . . . . . 99
7.4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.4.1.4 Project outcomes . . . . . . . . . . . . . . . . . . . . . . 102
7.5 Non technical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.5.1 Universities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.5.2 Meetups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.5.3 Companies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5.4 Hackathons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6.1 Technical Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6.1.1 Machine Learning and Large Scale data processing . . . . 105
7.6.1.2 Training to manage a large pool of computing resources . 106
7.6.1.3 Building Data Sensors . . . . . . . . . . . . . . . . . . . . 106
7.6.1.4 Learning about Networking and Cyber-security and In-
ternet of Things . . . . . . . . . . . . . . . . . . . . . . . 106
7.6.1.5 Building data visualization techniques . . . . . . . . . . . 107
7.6.2 Non-Technical benefits . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6.2.1 Work Ethic and Discipline . . . . . . . . . . . . . . . . . 107
7.6.2.2 Personal Skills . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A Appendix 109
Bibliography 110
List of Figures
2.1 TF-IDF notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Na¨ıve Bayes classifier model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 LDA notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Topics in Initial Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Worker aggregated performance over lifetime . . . . . . . . . . . . . . . . 37
3.3 Worker aggregated performance over lifetime . . . . . . . . . . . . . . . . 38
3.4 Step Histogram of Feedback difficulty score . . . . . . . . . . . . . . . . . 41
3.5 Score for observations on different time of day . . . . . . . . . . . . . . . . 41
3.6 The Score distribution of the silver set observations in final dataset . . . . 46
3.7 Topic Distribution of the final dataset . . . . . . . . . . . . . . . . . . . . 47
5.1 Topic Inference comparison at 100 passes . . . . . . . . . . . . . . . . . . 67
5.2 Topic Inference comparison at 200 passes . . . . . . . . . . . . . . . . . . 68
5.3 17 topics inferred in LDA analysis . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Topic Inference comparison at 300 passes in Delivery dataset . . . . . . . 71
5.5 Topic Inference comparison at 300 passes in Images dataset . . . . . . . . 73
5.6 Topic Inference comparison at 300 passes in stock dataset . . . . . . . . . 74
5.7 Topic Inference comparison at 300 passes in size dataset . . . . . . . . . . 75
5.8 Topic Inference Histogram at 300 passes in Delivery dataset . . . . . . . . 77
5.9 Topic Inference Histogram at 300 passes in Images dataset . . . . . . . . . 77
5.10 Topic Inference Histogram at 300 passes in Size dataset . . . . . . . . . . 78
5.11 Topic Inference Histogram at 300 passes in Stock dataset . . . . . . . . . 78
6.1 Final System with future developments . . . . . . . . . . . . . . . . . . . . 92
7.1 The structure of OCTO-SBG . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Goals of gaining insight into IoT traffic . . . . . . . . . . . . . . . . . . . . 100
7.3 CIP Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.4 Visualization application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
ix
List of Tables
3.1 Legend of X-axis labels for topic histograms . . . . . . . . . . . . . . . . . 30
5.1 The number of observations in each dataset . . . . . . . . . . . . . . . . . 64
5.2 The Topic legend for X-axis in inference plots . . . . . . . . . . . . . . . . 67
5.3 Mapping of LDA topics to Labeled topics . . . . . . . . . . . . . . . . . . 69
5.4 Summary of performance statistics of emerging topics in different datasets 70
5.5 Summary statistics of Delivery dataset . . . . . . . . . . . . . . . . . . . . 72
5.6 Summary statistics of Stock dataset . . . . . . . . . . . . . . . . . . . . . 74
5.7 Summary of score statistics of emerging topics in different datasets . . . . 81
5.8 Summary statistics of None of the Above dataset . . . . . . . . . . . . . . 81
x
Abbreviations
API Application Programming Interface
ASCII American Standard Code for Information Interchange
BOW Bag of Words
GIGO Garbage In Garbage Out
GUI Graphical User Interface
HDP Hierarchical Dirichlet Process
HIT Human Intelligence Task
hLDA hierarchical Latent Dirichlet Allocation
IDF Inverse Document Frequency
LDA Latent Dirichlet Allocation
LSI Latent Semantic Indexing
M3N Max Margin Markov Network
MCQ Multiple Choice Questions
MedLDA Maximum entropy discrimination Latent Dirichlet Allocation
MLE Maximum Likelihood Estimate
mTurk Mechanical Turk
NLP Natural Language Processing
NLTK Natural Language ToolKit
OOP Object Oriented Programming
pLSI probabilistic Latent Semantic Indexing
sLDA supervised Latent Dirichlet Allocation
ssLDA semi-supervised Latent Dirichlet Allocation
PyPI Python Package Index
SVM Support Vector Machine
TF Term Frequency
xi
Dedicated to my parents
xii
Chapter 1
Introduction
1.1 Introduction
Natural Language Processing (NLP) is one of the most rigorously investigated and ap-
plied domains in Machine Learning that is making a massive impact on how people
approach problems. With the dot com boom and the big data revolution, the power
of data has never been stronger. The computer industry itself has adapted to reap the
harvest from this newborn opportunity as one can see how software giants like Google,
Facebook and Cisco are building more and more data products. Natural Language,
amongst various data points being captured, is a very valuable data resource that is in
abundance in supply. The main method of expression of humans is natural language.
People all over the world use natural language as the one way of leaving footprint in the
Internet in the form of content, feedbacks, reviews, ideas, opinions and etc. . . A great
deal of insight can be drawn from this data and companies like Google make a major
fraction of their revenue via adding value to their services through this data. Through-
out this thesis, it is attempted to investigate how natural language can be used to bring
more value and insight to customer feedback analysis.
1.2 Background Context
The main input to this project is customer feedback collected from client websites that
are being supported by Qubit Digital Ltd.
1
Chapter 1. Introduction 2
1.2.1 Company
Qubit Digital develops technologies to collect and process web data. The company offers
solutions relating to tag management, marketing attribution and etc. . . They also offer
market analytics solutions relating to conversion optimization, customer intelligence and
behavioral attribution.
1.2.2 Product
The feedback collection system of Qubit captures user feedback. Initially, when a cus-
tomer visits a website, the behavioral models start predicting the next action of the
customer depending on his/her session history. When the system predicts that the cus-
tomer is going to leave the website, the website prompts the customer for feedback on
where the website can improve. Whatever customers type in that prompt is captured
and stored by the feedback collection system. This is the main data source being used
for the research project.
1.3 Problem Overview
The feedback collection system of Qubit Digital collects thousands of feedback entries
everyday from multiple client websites they serve. These feedbacks mainly relate to the
problems in the website or valuable suggestions on how to improve the website or the
business. This information is very valuable for improvement of products and services by
the client. The insight is vital for market adaptation and strategic growth.
1.3.1 Analyzing what is captured
Unfortunately, the scale of the feedback collection is massive and analyzing content
of such scale using human intelligence is infeasible both financially and operationally.
Therefore, automation is an essential feature in analyzing such a big feedback corpus. As
a starting point, clients are primarily interested in gaining insight into main topics such
as Delivery, Price, Product, Latency of the website and etc. . . As manually categorizing
different feedbacks into topics is unrealistic, an automatic topic classification solution is
necessary to learn the unique patters in feedback and map them to different topics.
Chapter 1. Introduction 3
1.3.2 Changing business landscape
Although topic categorization is automated, the business landscape is constantly chang-
ing. Even within the scope of an individual business entity, the topics being discussed
change when various business factors change. A static classification model cannot cope
up with a data generation process that is dynamic by nature. Therefore, the topic cate-
gorization engine should be sensitive to emerging trends and adapt itself to the changing
topics. A feature that can detect newly emerging topics is necessary to address this issue.
1.4 Solution Strategy
In order to solve the topic categorization problem, a supervised learning topic classifier
is a very good candidate. A set of essential topics will be defined initially. Then one
should use a human work force to correctly classify a sample of feedbacks into the pre-
defined topics. Then this dataset can be used to train a supervised learning classifier
that can automatically infer unseen observations into one or more pre-defined topics.
In order to extend the topic model automatically, latent topic detection can be used to
detect unseen topics that are emerging in the dataset. These topics have to be searched
in data that is does not belong to any of the pre-defined topics. Latent topic detection
is an unsupervised learning solution. Therefore, manual classification is unnecessary.
1.5 Chapter Overview
This work will partly contribute to topic categorization and topic detection in application
basis. It describes how supervised topic categorization and latent topic detection can
be used in industry setting to automate information extraction from customer feedback
data. It also attempts to enforce trust modeling heuristics and when using crowd sourced
datasets that use Mechanical Turk. This Thesis also details the attempts to formulate
a scoring function that can be used to evaluate topic consistency of an emerged topic
without using external corpora.
The chapter overview of the thesis is as follows:
Chapter 1: Introduction
This chapter introduces the background context, problem and the solution strategy
that is described in the rest of the thesis.
Chapter 1. Introduction 4
Chapter 2: Literature review
This chapter introduces all the background knowledge that is being explored and
exploited throughout the thesis. Details about the problem domain, background,
Application, Different techniques, methods and potential solutions are outlined.
The content discusses technical details, theory, and the current research landscape
of the domain.
Chapter 3: Data Collection and Preprocessing
As the name suggests, technical details regarding the data collection and data
pre-processing steps are detailed in this chapter. The main focus is on the data
collection process with a focus on the worker evaluation heuristics used. The
preprocessing steps are also described concisely.
Chapter 4: Topic Classification for labeled observations
This chapter details how the supervised learning model was built using the labeled
dataset. Feature extraction, Feature selection, Parameter tuning and evaluation
metric is discussed in detail. Finally the results are reported with the conclusion.
Chapter 5: Topic Detection within Unlabeled Topics
This chapter’s primary focus is on using unsupervised learning to detect emerging
topics from the unlabeled dataset. The experiment is carried out in 4 phases where
the initial phases focus on using the labeled data to drive the parameter tuning
process. Details of building the topic consistency measurement function are also
outlined in this chapter. The chapter concludes by reporting results and deriving
conclusions from the experiment.
Chapter 6: A change of perspective. . . CISCO
This chapter summarizes the details about the industrial project carried out with
Cisco Systems in California for one calendar year. The chapter outlines the ex-
perience, non-technical aspects and the machine learning related work undertaken
during this period.
Chapter 7: Conclusion
This chapter concludes the thesis. It starts by summarizing all the work under-
taken and results obtained. Then it discusses the primary conclusion of the study
in details. The chapter ends with a good description of potential future work that
can complement the work carried out in the thesis.
Chapter 2
Literature Review
2.1 Introduction
In this chapter, the main knowledge areas related to the research project will be discussed
in detail. An overview of each topic will be followed by evidence from research in the
topic supported by scientific literature.
The background domain of the research question, market analytics and insight is dis-
cussed in detail in the beginning of the chapter. The use of Natural Language processing
and the machine learning approach to Natural Language Processing is presented in the
next section with separate focus on data preprocessing, supervised topic modeling, La-
tent topic detection. The opportunities and eliminating the human loop in the topic
detection process is also discussed in this section supported with research evidence. The
building tools used in experimentation and the rationale is discussed in the last section
of the chapter.
2.2 Market Analytics
Rapidly pacing through the information age, being able to record and store massive
amounts of data has revolutionized the way people look at things. Data driven deci-
sion making, pattern recognition and prediction capabilities this big data revolution has
enabled people to approach analytics and decision making from a very different perspec-
tive. Along with this paradigm shift, a lot of domains and entities have adapted big
data and machine learning opportunities to reap invaluable benefits by having the edge.
It is observed that numerous domains, both commercial and otherwise such as Finance,
E-commerce and Marketing, Health care, Security, Physics, Search, Entertainment and
5
Chapter 5. iterature Review 6
etc. . . have aggressively adapted big data and machine learning approaches to enhance
their effectiveness.
2.3 Big Data and Machine Learning
Market Analytics is one of those fields that have been gaining a lot of momentum thanks
to big data and machine learning. Market analytics helps today’s organizations to use
the mammoth of data points available through customer feedback to understand and
tailor personalized products and services to their customer base better and quicker.
Machine Learning and data mining are used in market analytics in so many different
ways.
Large organizations such as Google, Facebook and Netflix maintain their own Machine
Learning teams within their marketing teams. Subscription companies such Netflix use
the vast amount of data they collect from customer buying patterns to model customer
churn (customer attrition) (Meuth et al., 2008). Amazon uses machine learning to
predict purchases to start shipping them before purchase goes through (Spiegel et al.,
2013). This lets Amazon have a competitive advantage over other online retailers by
minimizing shipping time.
During last few years, many new companies have sprung up in the field of enabling
other enterprises of all scales to apply machine learning and data mining on customer
feedback data. Companies such as Qubit Digital and Skimlinks provide Application
Programming Interfaces (API), Toolkits and portals to capture, process and analyze
customer feedback data.
2.4 Trust Modeling in Crowd sourced data
Collective decision making (Ensemble) is a well-studied area in social choice, voting and
Artificial Intelligence domains. The Condorcet Jury Theorem (Ladha, 1992) states that
if a collection of agents take a binary decision by majority voting, if the probability p of an
agent selecting the correct answer is ¿ 0.5, Adding more agents increases the probability
of obtaining the correct decision. With the recent developments in technology, companies
can now use crowd sourcing to carry out business tasks. Normally in cases such as
intelligence testing,(Anastasi and Urbina, 1997) a gold set (a known set) is used to
evaluate the responses in a crowd-sourcing task. Several attempts are recorded in studies
to use aggregated responses in IQ testing. The main research question in this project
is to see how to assess worker reliability using majority voting. Rayker et al (Raykar
Chapter 5. iterature Review 7
et al., 2010) proposes a machine learning based method that doesn’t model the question
difficulty. Bachrach et al presents a graphical model that models the question difficulty,
participant ability and the true response to grade aptitude tests without knowing the
answers.(Bachrach et al., 2012) These studies show that majority voting/ concordance
of answers is a very useful factor when evaluate the accuracy of a response.
2.5 Natural Language Processing (NLP)
Natural Language Processing (NLP) is an area that has been under extensive research
during the last few years. Natural language being the main form of communication
among people, a large quantity of data is being generated and consumed in the form of
natural language every day. Data is generated in the form of Books, articles, feedbacks,
reviews, blog posts and numerous other forms. Although a lot of business organizations
capture and record these data points and although humans can interpret natural lan-
guage very accurately, it is a tedious and expensive job to analyze this data by humans
as the human efficiency cannon keep up with the rate data is generated. NLP is an
upcoming solution to this problem that uses machine learning and pattern recognition
to analyze natural language computationally.
2.5.1 Data Preprocessing
Data preprocessing is a vital part of data science. Perfection is very difficult to attain
in real world and hence, requires preparation before using machine-learning algorithms.
Preprocessing data is mainly used to clean data by reducing noise, improving complete-
ness and etc . . .
2.5.1.1 Importance
Natural language is unstructured and diverse. It is also unorthodox compared to con-
ventional machine learning processes, as machine learning is a tool utilized to detect
patterns in numerical data. Due to the nature of natural language data, NLP involves
data preprocessing. (Danilevsky) Although this technique is quite straight forward, pre-
processing the text before vectorization has shown better accuracy and performance.
There are several techniques that can be used to preprocess natural text before vector-
izing it. They are,
• Spell Correction
Chapter 5. iterature Review 8
• Stemming
• Lemmatization
• Stopwords
2.5.1.2 Spell Correction
Humans are primarily generating text data. The amount of moderation and review
that the data undergoes before being published in the source can vary depending on
the data source. Text data from published material such as peer reviewed articles, ref-
ereed journals, news sources, books and other publications usually go through rigorous
and iterative quality assurance process (eg: Authour guidelines in journal publications).
But a fair fraction of text data generated by society such as user feedback, complaints,
tweets, comments and reviews do not undergo the formerly mentioned amount of mod-
eration and standardization. Therefore the latter is more error prone and more likely to
contain user human errors such as spelling, grammatical errors. The amount of human
commitment involved in generating user-generated text is very less compared to author
generated text.
Spell Correction improves the accuracy (precision) of results by reducing the noise in
the document set.
2.5.1.3 Stemming
Stemming is a method of collapsing distinct word forms. It allows collapsing different
words in different words to a distinctive root form. Due to grammatical effects on natural
language, the words undergo minor variations and transformations when being used in
documents. Therefore, methods are necessary to map all these variations of worlds into
one unique form to represent the similarity. Due to simple rule basis, word stemming is
computationally less expensive. But the gain of accuracy is less.
A stemming algorithm applies a selected number of word reduction rules sequentially.
The outcome of the stemming algorithm is not mandatory to converge to the morpho-
logical route of the word. But it is likely for several forms of the word to reduce to
the same root form. In English language, a lot of stemmers such as Lovins Stemmer,
(Lovins, 1968) Paise Stemmer.(Paice, 1990) In several studies, it has been empirically
shown that Porter Stemmer (Porter, 1980) is very effective. Stemmers for other lan-
guages also exist (xapian) (xapian.org) but are not relevant within the context of this
project as the dataset consist of customer feedback stored in English.
Chapter 5. iterature Review 9
2.5.1.4 Lemmatization
Some unorthodox words such as “is”, “are” and “am” cannot be stemmed to a unique
root form due their textual structure. If the accuracy of word normalization is very
important, word Lemmatization can be used instead of stemming. Word lemmatization
is a method, which does morphological analysis on the words to normalize them to the
stem form. This method doesn’t simply try to use character reduction rule set. It
analyses the full morphology of the word to identify the “lemma” using a dictionary.
This process is more accurate compared to stemming but this accuracy is gained at the
expense of computational complexity.
Both stemming and lemmatization improves the recall (specificity) of the results while
reducing the precision (sensitivity) of the results.
2.5.1.5 Stopword Removal
Stop words are the words that are being filtered and removed before or after processing
natural language text data. (Rajaraman and Ullman) Stop words are removed in text
processing to improve accuracy. Words can be classified as stop words due to several
reasons. They are words/tokens:
1. That are not useful in text processing
2. That do not have a impactful meaning
3. That are very common in a language
4. That only help the language structure (a, is, and , this and etc . . . )
Eliminating stop words increase accuracy by getting rid of useless tokens that are likely
to add noise to the documents. It is analogous to cleaning the dataset. It also reduces
the vocabulary of the model. This improves the performance and scalability of the
solution by giving benefits on resource utilization (Memory and Computation). Studies
show that stop word removal positively affects the accuracy of results. Furthermore,
tailor-made stop word lists have shown to result better performance compared to using
arbitrary stop word lists. (Xia et al., 2009)
2.5.2 Feature Extraction: Vectorization
Feature extraction is the process of transforming arbitrary data such as images and text
to numerical feature vectors that are usable by machine learning algorithms. Natural
Chapter 5. iterature Review 10
Language is conventionally represented in Unicode or ASCII string format (text) where
all the characters, words, sentences that retain information is represented. Unfortu-
nately, machine-learning algorithms are designed to process data in a numerical form.
Due to this reason, sting data should be transformed into numerical vectors (Vectoriza-
tion) that can be processed by machine learning algorithms. The primary method used
to transform natural language to numerical vector form is by using the Bag of Words
(BOW) technique. (Zaman et al., 2011)
2.5.2.1 N-gram Bag of words
N-gram bag of words is a feature extraction technique designed to extract word features
from text. The method builds up a numerical vector that represents the presence of
words in a document. In this representation, a document is represented as a multiset
(Knuth, 1998) of words that are in the document. This method is analogous to build-
ing histogram like statistics of the word occurrences in a document. This method is
used widely in Document classification and clustering problems as a feature extraction
method.
But in most real world applications, it is evident that world phrases make a lot more
sense compared to individual words. For instance, “time series” and “time machine”
refer to completely different meanings rather than using “time”, “series” and “machine”
to model separately and independently.
There may be several advantages and upsides in using word phrases for topic modeling
compared to words being treated separately. They are,
1. Word phrases are more informative: Word phrases give more information about
the word sequences
2. The phrase meaning is not always represented or derivable from individual words
3. The context of the phrase id different depending on the word set and the order of
their occurrence
4. that only help the language structure (a, is, and , this and etc . . . )
The word “n-gram” refers to a contiguous sequence of n units in a time dependent session.
Depending on the application, the definition word may refer to a Word phrase (NLP),
Amino Acid (Protein Sequencing), User state (conversion optimization), Network flag
(Netflow R (Reh´ak et al., 2008)), Base pair (DNA Sequencing). The n-gram model
with one word is referred as “unigram” model. 2 and 3 item models are referred to
Chapter 5. iterature Review 11
as “Bigram” and “Trigram” respectively. N-grams preserve the temporal features of
the observations because they preserve the word order. It also preserves the contextual
meaning of tokens as it preserves word phrases.
2.5.2.2 Term Frequency – Inverse Document Frequency (TF-IDF)
TF-IDF (Wikipedia) is a numerical transformation that weights the importance of word
tokens depending on the word distribution. This statistic is often used as a weigh-
ing factor in text analytics. TF-IDF can be used to determine stop words for in text
summarization and creating tailor made stop word lists.
Figure 2.1: TF-IDF notation
Text analytics problems like document classification and information retrieval tries to
find unique word phrases that can distinguish between different topics or concepts.
TF-IDF is calculated by multiplying two statistics.
1. Term Frequency: The frequency of terms that occur in each document. This
enables up-weighting terms that occur frequently in particular documents. This
assumes that words that occur more frequently have more influence.
2. Inverse Document Frequency: The inverse of frequency of documents the term is
occurred. The bigger the number of documents, smaller the IDF. This enables
down-weighting the terms that are present in a wider fraction of the corpus.
Depending on the application, various methods are used to calculate the TF-IDF value.
IDF is a heuristic that is used to increase term specificity. Although it has worked well
empirically, there are still attempts to understand the theoretical foundations of IDF in
the lines of Information Theory.
TF-IDF has shown to be a very effective transformation of bag of words in information
retrieval. (Ramos) Studies show that TF-IDF enables local document wide relevance of
terms via TF, and corpus wide relevance via IDF. (Blei and D.)
Chapter 5. iterature Review 12
2.5.3 Topic Classification
Topic classification is the process of categorizing content into predefined topics. This
can be done using a statistical language model that maps documents to topics. These
classifiers are called Model Based Topic Classifiers and they are very common in practice.
In Model based Classification, algorithm gathers a training set that contains observations
that are already classified and uses this dataset to build a topic model. Then the derived
model is used to infer topics for new documents.
There are two main machine-learning techniques that are being used to build supervised
topic models. They are Max-Margin based models and Maximum Likelihood based
models. Max-magin models such as Support Vector Machines (SVM) carries out pattern
recognition using vector geometry and Euclidean space. This approach tries to find the
linear model that creates the classifier with the maximum margin between classes. The
Latter approach, Maximum Likelihood Estimate (MLE) based techniques such as Na¨ıve
Bayes classifier builds a probabilistic model that represents likelihood. Both approaches
have been successfully used with bag of words (BOW). (Wang and Manning, 2012).
Training Max-Margin classifiers are relatively computationally less expensive compared
to MLE based models. This is because Max-Margin models are based on a Convex
optimization problem (Quadratic Programming) and hence has one minimum point (No
local minima). Studies have also shown that max-margin methods are effective in both
text-based classification and regression problems. (Zhu et al., 2009)
There is also extensive research on the marriage of both the Maximum Likelihood Ap-
proach and the Max Margin Method. Max Margin Markov Networks (M3N) (Taskar
et al., 2003) and Maximum Entropy Discrimination Latent Dirichlet Allocation (MedLDA)
Model (Zhu et al., 2012) are models that use the characteristics of both the approaches
together.
2.5.3.1 Support Vector Machines (SVM)
Support Vector Machines are a family of supervised learning models in machine learning
that used for both classification and regression. It is a non-probabilistic binary linear
classifier that uses a set of labeled training observations to build a linear model for
classification. The examples are represented as data points in a Euclidian space and
the algorithm derives the linear hyper-plane that separates two classes of examples with
the maximum possible margin. (Cortes and Vapnik, 1995) SVMs can be used to handle
non-linear classification problems by transforming the data points to a high dimensional
Hilbert space using the Kernal trick.(Boser et al., 1992)
Chapter 5. iterature Review 13
There are several advantages of using SVMs for classification. SVMs are:
• Uses only a subset of observations to build the decision boundary. Therefore it is
memory efficient
• Effective in high dimensional feature spaces. Their ability to learn is independent
of the dimensionality of the feature space
• Effective when the number of observations are smaller than number of features
• Different transformations are feasible in Dual form (Kernel functions: Radial Basis,
Sin, Gaussian, String and many more. . . )
• String Kernels come in very handy in text analytics
• Different regularization approaches can be enforced to control over fitting problem
The disadvantages of SVMs are:
• The likelihood (confidence) estimates are not straightforward
• If the number of observations are very less compared to number of dimensions,
performance can be poor
2.5.3.2 Na¨ıve Bayes Classifier
Na¨ıve Bayes Classifier is a probabilistic approach that applies Bayes Rule to observation
while naively assuming that the features are independent of each other. It calculates
the probability of observation n having label y conditioned on the values of features x.
Figure 2.2: Na¨ıve Bayes classifier model
The Maximum A Posteriori (MAP) is used to estimate P(Y ) and P(X|Y ). Na¨ıve Bayes
computes the P(Y ) by counting observations in the training set. There are several types
of na¨ıve bayes classifiers such as Multinomial, Bernoulli and Gaussian depending on the
assumptions placed on the distribution P(X|Y ).
Na¨ıve Bayes Classifiers are advantageous because they:
Chapter 5. iterature Review 14
• Are extremely simple as counting is involved
• Converges very quickly
• Performs reasonably well even if the independence assumption does not hold
• Probability estimates are more straightforward
• A decent classifier
The disadvantages are:
• Lack of error guarantees
• It is known to be a bad estimator, therefore probability values are not taken
seriously
2.5.3.3 Benefits of using SVM for text classification
SVM is a learning method that is well founded in the grounds of theoretical understand-
ing and analysis. SVMs are a good learning tool for learning due to a few reasons.
Strong learning theory base: It also stands very strong in statistical learning theory
as it is inspired by the concept of max-margin learning. (Cristianini and Shawe-
Taylor, 2000) It is based on structural risk minimization.
Better parameter tuning: The margin argument also suggests heuristics for selecting
good parameter settings for the learning algorithm. (Joachims, 1997) suggests how
to select the kernel width for Radial Basis Function (RBF). This helps to select
more efficient parameter tuning without using cross-validation.
Apart from SVM’s general ability to learn, this family is suitable for text catego-
rization for several reasons
Effective in feature spaces with high dimensionality: As the size of feature space
doesn’t limit the learning ability of SVMs, this is very suitable for text space where
the feature space is the vocabulary that can span to several thousands. The exis-
tence of Dual form in max margin based classifiers also make SVMs idea to work
with for text data with a large number of features.
Few Irrelevant features: By assuming that there is only a few relevant features, it
is possible to avoid high dimensional spaces input spaces. But in natural text,
most features (tokens) are relevant. Experiments by (Wang and Manning, 2012)
Chapter 5. iterature Review 15
using Na¨ıve Bayes classifier to train models using top-ranking features also show
that even features that are ranked way down the list carry considerable amount of
information.
Effective in sparse feature vectors: Documents make up sparse feature vectors as
they only contain a tiny fraction of the entire vocabulary.(Kivinen et al., 1995)
shows that additive algorithms that have an inductive bias similar to SVMs are
ideal for problems with sparse observations and dense concepts.
Linearly separable problems: Topic classification is often leads to linearly separable
data spaces due to the nature of natural language. Different words have different
meanings. As topics in topic classification problems are often independent of each
other, relevant features (words or tokens) are often different for each topic. Due to
this reason, text categorization problems are often linearly separable. Max margin
approaches such as SVMs are very suitable for these linearly separable problems.
Due to above-mentioned theoretical reasons and empirical evidence consistently visible
through various studies (Wang and Manning, 2012, Joachims, 1997, Kivinen et al., 1995),
Support Vector Machines are a very suitable technique to apply in text classification.
SVMs further complement the justification by enabling further extensions to apply String
Kernels (Lodhi et al., 2002) and different regularizations techniques to further customize
the solution to better complement the complexity of the text data.
2.5.4 Latent Topic Detection
Natural Language is structured in such a way that documents carry words that relate the
documents to more abstract concepts called topics. Normally, a topic can be represented
by a group of words that are significantly meaningful in describing a topic. These topics
are assumed to be “hidden” (latent) inside corpora. Often these topics are unknown.
In a machine learning perspective, latent topic detection is a unsupervised learning
problem. These approaches towards it can be construed in two ways:
Dimension reduction: It can be interpreted as a dimension reduction process where
all the words in a document are reduced to groups of highly correlated words
that belong to topics. Examples for algorithms using this approach are Latent
Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI). This approach
is fairly popular towards Latent topic detection and has showed very promising
results.
Chapter 5. iterature Review 16
Clustering: Latent topic detection can also be interpreted as a clustering problem. In
this approach, documents are treated as observations and clustering can be based
on how similar documents are. The corpus is clustered into document groups that
are similar to each other.
2.5.4.1 Latent Semantic Indexing (LSI)
While tf-idf provides good representation of the words that are discriminative for docu-
ments in the collection, it doesn’t provide a heap of information about inter-document or
intra-document statistical behaviors. Information retrieval researchers have suggested
LSI (Deerwester et al., 1990) to address this problem. LSI is a singular decomposition
based dimension reduction technique that finds linear subspaces that best preserve the
variance of the data. A significant further enhancement of the concept is achieved later
by using probabilistic Latent Semantic Indexing (pLSI).(Hofmann, 1993) pLSI models
each word in a document as a sample of a mixture model. Each word is generated by a
single topic and different words in a document may be generated from different topics.
Both LSI and pLSI has its advantages
• More sensible and informative way to model natural language text
• Can map large documents into reduced descriptions of linear subspaces
• Can capture linguistic aspects such as polysemy and synonymy
• Useful to develop generative models from corpora
There are also disadvantages
• Singular value decomposition and probabilistic parameter training are computa-
tionally expensive
• It is possible to model text more directly by fitting the model using maximum
likelihood or Bayesian methods
2.5.4.2 Latent Dirichlet Allocation (LDA)
LDA is a probabilistic unsupervised topic detection approach that is used to detect the
hidden topics in a document corpus. It uses a probabilistic graphical model that has been
Chapter 5. iterature Review 17
designed to represent the natural language generation process to find concepts/topics
(groups of significant words that co-occur) that are hidden in the documents.
In the graphical model of LDA, there are two repeating plates, for each document and
for each topic. The data generation model assumed in LDA is as follows:
For each Document,
• N number of words are generated by a Poisson Distribution
• Topic distribution has a Dirichlet Distribution
• For each word,
– Choose a topic from a multinomial distribution
– Choose word from multinomial of word generation distribution conditioned
on the topic
The plate notation of the LDA graphical model is given below.
Figure 2.3: LDA notation
W is the specific word observation. This variable is observed in documents. The rest are
latent variables. Alpha is the parameter for Dirichlet prior on the per-documents topic
generation distribution. Beta is the parameter for Dirichlet prior on the per-topic word
generation process. Teta represents the parameter topic distribution for each document.
Z is the topic allocation distribution for each word in the document. The number of
a topics is a variable that has to be manually chosen to better suite for the statistical
structure of the text and the ultimate goal.
Chapter 5. iterature Review 18
Most natural language models deal with large vocabularies. This raises a serious issue
of sparsity. It is highly likely to observe new words in new observations that have never
been encountered in training corpora. As Maximum likelihood estimates of multinomial
distributions assign zero value to such words, a smoothing step can be added to LDA
graphical model to assign probabilities to all words regardless of they were encountered
in training phase.
There are very good properties in LDA that suite it for latent topic detection:
• It is an enhancement of pLSI that uses a Bayesian approach
• It is specifically designed for Latent topic detection
• More suitable for small datasets as Bayesian approach addresses over fitting
• It is highly modular, can be easily changed, extended or tailored to specific appli-
cations
• Distributed and online LDA versions help deal with time costs
There are also some disadvantages of using LDA:
• The learning of parameters is computationally expensive
• No guarantee to converge to global minima which demands multiple runs
2.5.4.3 Spectral Clustering
Spectral clustering is a clustering technique that uses dimensionality reduction before
clustering the dataset. As the name suggests, spectral clustering uses the “spectrum”
(eigenvalues) of the data to do clustering. Eigenvalues are computed using the similarity
matrix that represents the similarity between different observations. Conventionally,
data is not transformed using Eigen vectors before clustering. But, Spectral clustering
does otherwise. Spectral clustering is ideal when the structure of the cluster is highly non
convex or if the center and the spread of the observations is not a trivial representation
of the cluster (Which is very likely in document representations).
Spectral clustering comes in two flavors Normalized and Un-normalized. Unnormal-
ized spectral clustering uses the unnormalized graph Laplcian of the similarity matrix.
(Von Luxburg, 2007) The normalized method based on normalized graph Laplacian
comes in different flavors (Shi and Malik, 2000, Ng et al., 2002)
The advantages of spectral clustering are as follows:
Chapter 5. iterature Review 19
• Suitable for clustering problems with non-trivial structure (cluster center and
spread) which is ideal for text data
• Converges to global minimum that guarantees consistent solutions
• Elegant mistake bounds and guarantees
• Kernels can be used to tackle non-linear data landscapes
Disadvantages of Spectral Clustering are:
• More general nature of the algorithm doesn’t specifically focus on text data
• Computing Eigenvectors is computationally expensive
2.5.4.4 Benefits of using LDA for text classification
After careful consideration and objective analysis, Latent Dirichlet Allocation algorithm
is an ideal choice for latent topic detection. There are several reasons behind this justi-
fication. The idea choice should be a tool that can enable state of the art performance
and accuracy while devising sufficient room for enhancing and specializing to the task at
hand. When considering both these requirements, LDA appears to be the most suitable
amongst the candidates.
The specificity of the solution: The main use case of clean LDA is latent topic de-
tection. LDA has been specifically designed to detect hidden topics in natural
language. The graphical models shown in 2.3 are specifically designed to model
the different elements of the text data generation process. The underlying dis-
tributions of different variables and their dependencies directly map to text data
generation. Due to this reason, LDA is a strong candidate for latent topic detec-
tion.
Proven for accuracy: Spectral clustering is a more generic solution that works well
for problems that resemble the statistical behaviors similar to NLP problems. But
LDA is specifically built to solve the latent topic detection problem. There is more
specific attention to details in text data generation process which allows LDA to
gear performance and accuracy more.
Although pLSI is more accurate than LSI, LDA is an enhancement of pLSI. The
main improvement of LDA over pLSI that LDA devices the Bayesian version
of pLSI. LDA also inherits better performance in handling over fitting in small
datasets due to its Basesian formulation.
Chapter 5. iterature Review 20
High modularity: Extending LDA is relatively feasible as it is a highly modular graph-
ical model. Correlated topic models (Blei and Lafferty, 2006) can be built by
modeling the relationship between latent topics. By modeling topics in a hierar-
chy using a Chinese Restaurant Process, hierarchical LDAs (hLDA) can be built.
(Blei et al., 2004) The abstract model can further be enhanced into semi super-
vised (Zhu et al., 2012) and supervised learning (Blei, 2012) spaces. Due to the
high degree of extensibility of LDA model, there is more room for creativity within
the solution.
Due to the above reasons, LDA is a great choice for a experimental text analytics project
of this nature. LDA is a more suitable model specifically built to address text data
based topic detection. This choice compliments the solution by enabling high degrees of
extensibility allowing to further tailor the tool to solve the problem in hand.
2.6 Topic Consistency Evaluation
Capturing the latent topics (set of words) has always been an interesting research area
in Natural Language Processing. Although statistical evaluation of topic models (using
model perplexity (Wallach et al., 2009)) is reasonably investigated and understood, there
has been much less work on evaluating sematic quality of the learned topics. Some efforts
are evident in using point wise mutual information to evaluate topic coherence. A very
effective intrinsic evaluation technique uses external resources such as WordNet, Google,
Wikipedia to predict topic coherence. The models are based on link overlap, term
co-occurrence, ontological similarity and various other features. The best performing
approach amongst these is using the term co-occurrence within Wikipedia data based
on point wise mutual information.(Newman et al., 2010)
2.7 Tools and Technologies
2.7.1 Amazon Mechanical Turk (MTurk)
Lately, there has been a lot of popularity in using crowdsourcing to enhance productivity
and efficiency in many fields. Data collection and classification is one of the applications
in the forefront. A lot of recent research initiatives base the knowledge of the crowds as
one of the core components of the project. Projects such as Zooniverse (zoo, 2012) is a
good example for this trend.
Chapter 5. iterature Review 21
MTurk is an online marketplace that enables individual entities (people and organiza-
tions) to coordinate the use of human intelligence to get their work done. This is one
of the services offered under Amazon Web Services 1. The main application of MTurk
is to use distributed human intelligence in tasks that cannot be fulfilled by computers.
In MTurk, a “requester” advertises a job that will be offered to available “workers”.
The requester defines the interfaces and other technology required for data entry by
using the MTurk Application Programming Interface (API). Requesters can further set
qualifications and the price being paid for work undertaken. The “workers” will look
at the jobs advertised and work on their desired jobs as they desired and get paid for
the amount of work they fulfill. Mechanical Turk is very popular amongst research
communities to generate the required data economically. Various studies have shown
that although Amazon mechanical Turk is fairly less successful in representing specifi-
cally defined populations, it is a great tool for data generation using random sampling
(Berinsky et al., Paolacci et al., 2010, Buhrmester et al., 2011). It is also evident that
MTurk is fairly economical in conducting surveys and collecting data as the cost is close
to half the minimum wage in the US.(Horton and Chilton, 2010)
As the project involves topic classification, Amazon MTurk is needed to carry out the
initial classification of the dataset. The feedback observations should be placed in MTurk
to get them classified into different labels. Then this labeled dataset can be used to carry
out supervised topic classification.
2.7.2 Python 2.7 (Programming Language)
Python is mainly a scripting language whose design philosophy emphasizes most on code
readability. It is a high level programming language that allows programmers to express
ideas in simpler and fewer lines of code compared to programming languages such as
C++ and Java.(Summerfield, 2008, Guido) Python implementation was started in 1989
mainly led by Guido van Rossum, as a successor of ABC language.
Python, like many other programming languages today, is a multi-paradigm program-
ming language. It fully supports Object Oriented Programming (OOP) and Structured
Programming paradigms. The main attraction of python is its support of Functional
Programming paradigm. The core design philosophy of the language itself is the element
of fascination about this simple yet powerful tool.
• Beautiful is better than ugly
• Explicit is better than implicit
1
https://www.mturk.com
Chapter 5. iterature Review 22
• Simple is better than complex
• Complex is better than complicated
• Readability counts
(Peters, 2008)
Python releases come in 3 types. Backward incompatible versions, major feature releases
and bug fixes. Backward incompatible versions are identified by the increasing first part
of the version number (eg: Python 2.x vs. Python 3.x). These releases are not expected
to work seamlessly with each other. The code has to be manually ported. Major feature
releases are identified by increasing second part of the version(eg: Python 2.6 to 2.7).
The code is expected to work seamlessly among feature releases. Bug fixes do not get
different version numbers.
Python 2.7 is an excellent choice for the project due to various reasons. The main
rationale behind moving with Python 2 is because Python 3 is not mature and libraries
around python 3 is scarce. On the other hand, Python 2 has been maturing for close to
a decade and has heaps of special purpose libraries that can embed special features into
it. Version 2.7 is idea because it is the latest python 2 version. The support community
around python 2 is also mature and dense. The special purpose data handling and
machine learning libraries available in Python 2 (such as scikit-learn, pandas, nltk and
etc. . . ) are very vital to achieving better and results.
2.7.3 Special purpose libraries
Python 2 has a large standard library. Having a massive arsenal of tools suitable for
many tasks is one of Python’s greatest strengths. This property of Python is emphasized
by using the clich´e “Batteries Included”. The amazing thing about Python is that it is
not essential to include the full standard library when running or embedding it. Python
packages are standardized and published at Python Package Index (PyPI) 2. As of July
2014, there is more than 46800 python libraries indexed in PyPI. The main functionality
covered in python packages include:
• System Frameworks: GUI, Web, Multimedia, DataBase
• Support Tools: System Administration, Test Suites, Documentation tools
2
https://pypi.python.org/pypi
Chapter 5. iterature Review 23
• Scientific Computing: Numerical, Statistical, Text processing, Machine Learning,
Visualization, Image Processing
Amongst them, the Scientific computing range of libraries are essential for Text analytics
projects to carry out the data handling, pre-processing, machine learning, analytical and
visualization components. Some of the libraries used for this project are outlined below.
2.7.3.1 Pandas
Pandas is an open source, high performance set of data structures and data analysis tools
developed for data analysis in Python programming language.(McKinney, 2013) This li-
brary offers functions and data structures to manipulate and manage large datasets with
numerical tables and time-series data. Pandas was born in 2008 when Wes McKinney
started working on creating a more performing and flexible tool to perform quantitative
analysis on financial data at AQR Capital Management.
There are some great features in Pandas that makes it the ideal choice for a data man-
agement tool.
• Fast and efficient DataFrame object similar to DataFrame object found in R
• Intelligent data alignment and flexibility in reshaping data into different forms
• High performance data merging and managing engine that also incorporates hier-
archical indexing capabilities
• Ability to convert datasets from In-memory data structures to CSV, text, SQL
and HDFS data formats and vice-versa
• Time series structures, scipy and numpy compatibility
Above properties make Pandas the ideal tool for handling large datasets for data process-
ing and machine learning tasks. The choice complements the range of other python li-
braries that are compatible with pandas data structures (such as scikit-learn and NLTK)
2.7.3.2 Scikit-learn
Scikit-learn is a machine learning toolkit written in Python. It is an open source project
and is operated to seamlessly operate with python numerical libraries Numpy, Scipy and
matplotlib.(Bressert, 2012) Scikit-learn features a range of machine learning algorithms
Chapter 5. iterature Review 24
that enable Classification, regression, clustering and dimensionality reduction. It also
features additional tools such as model selection and data preprocessing algorithms that
compliment the machine-learning offering.
Scikit-learn started as scikit.learn, a Google Summer of Code project which was started
by David Cournapeau. The codebase was later rewritten by other contributors. Scikit-
learn is one of the libraries that grasp popularity and is being well maintained to this
day. It also consists of an elegant API to useful machine learning algorithms. (Buitinck
et al., 2013)
Scikit-learn is a very useful machine-learning library with useful algorithms such as
SVMs, Na¨ıve Bayes classifiers, Nearest Neighbor and a lot more. This python library is
more further suitable for this project as it also contains text specific preprocessing algo-
rithms such as stop word deletion, n-gram bag of word transformation, tf-idf vectorizer
and etc. . .
2.7.3.3 Natural Language ToolKit (NLTK)
Natural Language TookKit is a computational linguistics library package developed in
python programming language. Commonly know as NLTK, this package was initially de-
veloped by Steven Bird, Edward Loper and Ewan Klein to support research and teaching
in NLP.(Bird et al., 2008) This NLP toolkit includes various linguistic functions such as
stemmers, lemmatizes, visualizers, graphical demonstrations and sample datasets. The
toolkit is further supported with a book explaining the concepts realized via the lan-
guage processing talks in the toolkit.(Bird et al., 2009) It is also accompanied with a
cookbook. (Perkins, 2010)
This tookit is suitable for this project for numerous reasons. The main one is that it
covers a wide range of natural language processing tasks that are potential elements of
this project. It contains stemmers, lemmatizers, vectorizers and other language tools
that can transform text. The other main feature that complements this project is that
NLTK has a wrapper that allows the use of scikit-learn via the API.
2.7.3.4 PyEnchant
A main problem that arises when dealing with human generated text is the numerous
spelling errors that add noise to the data. When humans enter text, they make mistakes
due to various reasons such as literacy, ignorance, accidents and etc. . . But all measures
should be taken to avoid these trivial errors from affecting the analyses. The best way
to address this issue is to use a spell checker to correct trivial errors in the document.
Chapter 5. iterature Review 25
PyEnchant is the python implementation of Enchant library which is written in C.
PyEnchat is a generic spell checking library that can be used to correct spelling errors.
Enchant project was developed as part of the AbiWord project. Ryan Kelly maintains
PyEnchant. It has the capability to load multiple backends at once such as Aspell,
Hunspell, AppleSpell etc. . . PyEnchant makes sure that the users can use the native
spell checker in various different platforms (Mac OS X, Microsoft Office). It also provides
user with the functionality to load dictionaries, add custom words, to query if a word is
spelt correctly and request spell correction suggestions for a misspelt word.
Above functionality in pyEnchant enables spell checking and spell correction in feedback
text which is prone to spelling mistakes. Therefore, pyEnchant can help the analyses by
correcting the spelling mistakes and hence lowering the noise in the dataset.
2.7.3.5 Gensim
Gensim is an open source topic modeling and Vector Space modeling toolkit that is being
developed in Python programming language. It uses numpy and scipy to optimize data
handling and Cython to increase performance. The main focus of Gensim is to build
online algorithms that can handle large corpora. Gensim has been carefully designed to
address the issues of scalability and memory limitations that were holding back large
text analyses.(ˇReh˚uˇrek and Sojka, 2010)
Gensim provides tf-idf vectorizers and other text analysis algorithms such as random
projections, Heirarchical Dirichlet Process (HDP), Latent Semantic Indexing (LSI) and
Latent Dirichlet Process (LDA). Gensim is ideal for this project as it includes all the tools
to vectorize the text and then run different latent topic detection algorithms such as LSI
and LDA. This library complements this ability by also implementing the distributed/
online versions of these algorithms. One good example for this is the inclusion of online
LDA algorithm (Hoffman et al., 2010) which will be very useful for this project.
Chapter 3
Data Collection and
Preprocessing
3.1 Introduction
The first step of analyzing text-based data for topics is to collect the required data and
to preprocess it to be compatible with the analyses. The primary approach towards
building a feedback analysis engine is to build a consistent topic classifier and to build
a strong topic detector.
The classifier will enable classifying new observations into one or more already defined
topics. Building a topic classifier is a supervised learning problem. Labeled observations
are required to solve supervised learning problems. The topic detector will enable the
engine to learn new emerging topics in the system on the go. This system is required to
gain more insight into the meaning of documents that do not belong to any of the topics
already defined in the classifier. This is an unsupervised learning problem as the system
has no prior knowledge about what topics are present in the system (new topics are not
defined). The observations that are being classified as “None of the above” would be
ideal for such a task.
Once the data is collected, it has to go through several preprocessing steps to filter
and improve the effectiveness of data items. Preprocessing enables extracting extra
information, better structuring information and reducing the background noise in data
before data is analyzed. In this chapter, a detailed discussion will be presented about
the data collection process and it results. Explaining the pre-processing steps used to
enhance user feedback data will further complement it.
26
Chapter 3. Data Collection and Preprocessing 27
3.2 Data Collection Phase
The initial dataset for building the text analytics engine was extracted by the Feedback
collection system implemented by Qubit Digital. This system prompts their customers
seeking for valuable feedback about their products and services while their clients are
surfing the company website. Feedback given by the customer would be then stored in
a database. The customer feedback observations for this study are extracted from the
aforementioned database.
3.2.1 Labelling the observations
As the project involves a supervised learning problem, the observations have to be labeled
first. Amazon Mechanical Turk was used to label the observations. This is due to two
main reasons:
1. Enable maximum utilization of the budget
2. Reasonable reliability for the cost incurred in data collection
10 topics were initially defined for the study. Namely, they are Delivery, Images, Latency,
More Functions, Price, Products, Range, Size and Stock. An additional category called
“None of the above” was added to add the observations that do not belong to any of
the groups. Altogether there are 11 labels in the classification task.
The classification task was designed to be carried out in the form of a Multiple Choice
Question (MCQ) survey format where the worker has to pick one or more choices (topics)
that he/she believes the observation (feedback text) belongs to. The HIT (Amazon,
2014) would start once the worker starts the job. A feedback observation from the
dataset will be presented to the worker with the 11 topic choices. The user will have
to pick one/ more topics (checkboxes are presented) and submit the job. Then he/she
will be presented with another HIT. Most workers will complete multiple HITs in one
session.
As mentioned in Section (MTurk), there is a major concern about using Amazon Me-
chanical Turk for dataset classification. As the compensation for labour is relatively less,
workers are likely to be motivated to finish more HITs quicker to earn more money by
completing more number of HITs. In the context of MCQ surveys, this might motivate
the worker to randomly select choices to complete more HITs per unit time. Researchers
should make sure that the workers do not compromise accuracy over high throughput.
Chapter 3. Data Collection and Preprocessing 28
In order to address this issue, a trust based heuristic approach is used to quantify the
reliability of workers. This approach will be discussed in detail in section 2.4.
The data collection approach used to generate the relevant dataset for the task is as
follows. First, feedback observations were randomly selected from the customer feedback
database. After this, the data classification task is launched using Amazon MTurk. The
data classification task is run in two major phases:
Initial Trust Modelling phase: In initial trust modeling phase, primary focus is to
understand the reliability of the data classification process. An initial sample of
the selected observations is used to analyze the behavior of data in terms of worker
accuracy and reliability in MTurk. Depending on the behavior of the workers, a set
of heuristics and metrics are derived to further assess and refine the reliability of
the data collection process. More details about how the trust modeling heuristics
are derived is explained in section 2.4
Controlled Data Collection: Once the Heuristics are empowered, the full dataset
will be classified using MTurk. This process can be assessed continuously while
data classification is going on. Therefore, it is possible to control the reliability
factor of data. Once the data is classified, the metrics derived can be used to assess
the final dataset and clean the dataset further. The final data collection process
and post data collection cleaning phase is described in section 3.2.6
3.2.2 Initial Trust Modelling phase
When using MTurk, requesters should take precautions to make sure that they get
clean data for analyses. Like every computer system, supervised learning algorithms too
follow the Garbage In Garbage Out (GIGO) principal. Therefore considering the topic
classification task at hand, the cleanliness and accuracy of the labeled data is extremely
vital to ultimate accuracy of the topic classification engine. The primary focus of the
Initial Phase is to understand the behavior of the worker reliability in MTurk.
One reliable and realistic approach to understand the behavior of the worker reliability
is to run an empirical analysis on the process. That is to launch a pilot job and find
out how successful the HITs will run. The structure of MTurk is vital to planning the
investigation procedure. In Amazon MTurk, workers log in from all over the world to
commit to HITs and carry them out. These workers are very likely to be independent
from each other.
The dataset generation process for the initial phase is as follows:
Chapter 3. Data Collection and Preprocessing 29
• Initially, 10,000 unique records were randomly selected from the user feedback
database to create a dataset
• Each unique observation in the dataset was replicated twice to generate 20,000
more observations. Therefore the dataset consisted of 30,000 observations each
with 3 replicates of each unique observation.
• Then the dataset was thoroughly shuffled to randomize the sequence of records.
• This dataset will be called “The Silver Sample”
Ultimately, this dataset was published MTurk as the initial classification job. Once
20,000 observations were complete, the job was kept on hold for a couple of days. This
was done deliberately to attract a new set of workers and increase diversity and worker
independence of results. Once all 30,000 observations have been classified, the results
were collected and analyzed.
Dataset consisted of 2 files that contain 19,998 and 10,000 records. The dataset consisted
of numerous fields of which the most important fields were extracted and written to
a different file. The new dataset consisted of the worker ID, feedback text and the
classification. Using this data, several analyses related to the topic distribution, worker
scores and question scores were carried out.
3.2.2.1 Topic Distribution of the dataset
When Analyzing the topics, we used the ordering of the topics in the User Interface
(UI) in order to see if the selections are biased towards the top, middle or the bottom
selections of topic choices in the UI. The topics are as follows:
From figure 3.1, it is evident that most topics are uniformly distributed except topics
3,7,9 and 10. But this is natural as user feedback can come on any topic and it is likely
that some topics are discussed more than the others. The observation makes more sense,
as the topics discussed more often are nominally Price, Navigation and More Functions.
It is also realistic that the highest number of records belongs to “None of the above”
category. This means that customers express feedback about numerous other topics and
they do it more frequently.
Above observation is an early indication that the workers are not biased towards the
obvious choices in UI. This is a good indicator about the reliability of the workers.
Chapter 3. Data Collection and Preprocessing 30
Ref Number Topic/Label
0 Delivery and Delivery cost
1 Page loading time, errors
2 Images
3 Price
4 Products
5 Product Range
6 Size
7 Navigation, Search and Design
8 Stock Availability
9 More site functionality
10 None of the above
Table 3.1: Legend of X-axis labels for topic histograms
Figure 3.1: Topic Distribution of the initial dataset
3.2.2.2 Initial Sanity check
As an initial sanity check, a “golden set”, a complete random sample of 100 records
were selected from the dataset and was classified manually to check the accuracy of
the golden set sample. This is a good approach to do a quick check if the classification
process is accurate. Although the sample size of 100 is reasonably small for a population
of 30,000 records in terms of confidence, it is a quick measure that is helpful to gain more
knowledge with very little time and effort cost. Around 90% of the records were correctly
classified in the randomly selected sample. Therefore, it is sensible to think that this too
is a good indicator of initial data classification process to be successful. There are only
around 10% misclassifications in the random sample. Therefore it is highly probable that
the misclassifications are not due to deliberate error. With the accuracy level obtained,
it is highly unlikely that making fast money and randomly picking choices motivate the
Chapter 3. Data Collection and Preprocessing 31
workers. Due to this reason, a soft assumption can be made that the workers are all
motivated towards classifying the data correctly and the misclassifications are mainly
due to misunderstanding the context or other human error beyond their control.
3.2.2.3 Worker Scoring
As Mechanical Turk job was set up in such a way that three independent workers classify
each feedback, majority voting can be used to verify correct answer and hence measure
the consistency of the workers. As majority voting can partially verify the accuracy of
the classification, concordance of topic choices per single observation can be used as a
measurement of confidence level. More concordance gains more confidence on worker
classifications. The underlying assumption for such a claim is:
If a worker selects topic choices that are similar to those of other independent workers
who evaluate the same record,
• The selection is more likely to be correct, as two other independent workers have
chosen the same selection. Odds are very small for such an occurrence happen
randomly when there are 11 choices.
• The worker is very likely to possess more ability in finding the right topics, as
the worker shows to have the ability to pick the correct answer. Over continuous
choice, the consistency of the correct answer is clear evidence of workers ability to
classify correctly.
On the contrary, if the worker selects topic choices that are different from other workers
who evaluate the same question,
• The worker is very likely to possess less ability in finding the right topics and more
likely to be misclassifying.
• But, this doesn’t necessarily mean they are misclassifying, but only that it is highly
unlikely that they are classifying correctly.
With the above-mentioned logical framework in mind, a concordance score based worker
scoring algorithm was built in order to evaluate the reliability of different workers.
In order to formulate a scoring function for worker reliability, the following lists have to
be generated first:
Chapter 3. Data Collection and Preprocessing 32
1. RECORDS1: that contains the list of unique feedbacks in the dataset
• Consists of two columns feedback, Indices[index1, index2, index3]
• First column represents the unique feedback record
• Second column gives a list of indices where this feedback record is repeated
(as 3 replicates are present, each index list will have 3 entries)
2. Workers1: unique list of worker Id s in the dataset
• Consists of two columns workerID,Score
• First column represents the unique worker ID given by MTurk to each worker
• Second column holds the computed score for each worker
The algorithm for scoring is presented in pseudo code below:
Algorithm 3.1:
foreach unique feedback in RECORDS1:
load the replicate observations using the indices
create a label count histogram for label occurrences (in all observations)
// to normalize the label count for the number workers
foreach entry in label histogram:
divide entry by number of workers for that feedback (1,2 or 3)
end for
// now we have normalized scores for each class within the feedback entry
// now lets start scoring workers
foreach observation having feedback value:
initiate tempScore = 0.0
foreach label in the chosen observation:
tempScore += (Normalized label count from histogram)
end for
// normalize the tempScore for number of labels in that observation
tempScore /= (number of labels in observation)
Add the normalized tempScore to the employee’s score in WORKERS1
end for
end for
Chapter 3. Data Collection and Preprocessing 33
count the observation entries per worker
normalize the worker score by number of jobs he/she did
The Python implementation of algorithm 3.1 is outlined below
import gensim
import pandas as pd
import sklearn
import numpy
import sys
import qbPreprocess as qbPre
import qbPrepare as qbPrepare
import qbReliability as qbRel
# ################################################################################################
# this function computes inference probability for each docuemt for each topic
def doTopicAnalysis (path ,topic ,ordData ):
lda = gensim.models.ldamodel.LdaModel.load(’data/results/resultSet_ {0} ’.format(path ))
numTopics = lda.num_topics # No. of topics in the saved model
dictionary = gensim.corpora.Dictionary.load(’bow.dict ’)
mm = gensim.corpora.MmCorpus(’data/corpora/corpus_ {0}. mm’.format(topic ))
topicDensity = numpy.zeros ([len(mm),numTopics ]) ## to store the average probability values
y = qbPrepare.generateY(ordData)
classes = qbPrepare. yTransformer .classes_
i = 0
for feedback in mm:
doc_bow = lda[feedback] # compute probabilities
doc_bow = numpy.array(doc_bow) # convert to numpy array
vec = numpy.zeros(numTopics) # new fresh vector
for entry in doc_bow:
vec[int(entry [0])] = entry [1] # fill the vector with probabilieties
# normalize the probability vector
norm = sum(vec)
vec /= norm
maxP = numpy.argmax(vec)
topicDensity [i ,:] = vec; # assign value
i += 1
Chapter 3. Data Collection and Preprocessing 34
scatter = numpy.zeros ([len(classes),numTopics ])
i = 0
for topic in classes:
indices = ordData[ordData.answer == topic ]. index
tempMat = topicDensity [indices ,:]
tempVec = numpy.mean(tempMat ,axis =0)
scatter[i ,:] = tempVec
i += 1
reducedTopicDensity = topicDensity ;
temp = lda.show_topics(numTopics)
file = open(’data/topics/resultSet_ {0}. txt’.format(path), "w")
i=0;
for t in temp:
file.write(’topic :{0}================================== n{1}nn’.format(i,t))
i += 1
file.close ()
# print reducedTopicDensity .shape
return reducedTopicDensity ,temp ,scatter
# ########### main script ###########################################
## This script computes the topic consistency scores
## main parameters
threshold = 0.9; # threshold for record specificity
ci = 1.0; # confidence interval scale
path = sys.argv [1]
topic = path.split(’_’)[0];
ordData = qbPre. readDataFrame (’data/write/ dataSet_None_ {0}. csv ’.format(topic),None ,0)
# Load the statistics and probabilities
topicDensity ,topics , scatter = doTopicAnalysis (path ,topic ,ordData)
numTopics = topicDensity .shape [1]
# table to store supervised task based results ( precision and recall )
vals = numpy.zeros ([ numTopics ,2])
# table to store the unsupervised task based results
# (lda topic number , positive score , negative score , aggregated score)
valsUn = numpy.zeros ([ numTopics ,4])
docCount = len(ordData[ordData.answer == topic ])
Chapter 3. Data Collection and Preprocessing 35
wholeCount = len(ordData)
diff = len(ordData[ordData.answer == ’None ’])
# foreach topic
for i in xrange (0, numTopics ):
topicVec = topicDensity [:,i]; # choose inference values for the topic
wholeCount = len( topicDensity )
docs = topicVec[topicVec >= threshold] # docs that are top percentile positives
docsP = numpy.where(topicVec >= threshold)
# indices of top percentile positives
docsIndex = list(docsP )[0]
notDocs = topicVec[topicVec <1.0 - threshold] # docs that are least percentile negative
# indices of docs that belong to the same labeled topic
realIndex = numpy.array(ordData[ordData.answer == topic ]. index)
# true positives >> intersection
tPostives = numpy.intersect1d(docsIndex ,realIndex)
# normalise probabities in relation to full distribution
# sum of probabilities of all documents equals 1
noralised = True; # parameter can be True/False
if noralised:
docs /= numpy.sum(topicVec)
notDocs /= numpy.sum(topicVec)
valsUn[i,0] = i; # LDA topic number
valsUn[i,1] = numpy.sum(docs ); # total score of positive documents
notDocs = 1.0- notDocs; # linear negative tranformation
valsUn[i,2] = numpy.sum(notDocs ); # total score of negative documents
valsUn[i,3] = valsUn[i ,1]+ valsUn[i ,2]; # aggregate of positive negative score
# compute precision and recall using labelled dataset for comparison
precision = float(len(tPostives ))/ float(len(docsIndex ))
recall = float(len(tPostives ))/ float(len(realIndex ))
# store values for perticular topic
vals[i ,0] = precision
vals[i ,1] = recall
## normalise the unsupervised positive negative scores vertically
dNormalised = True; # parameter can be True/False
if dNormalised :
valsUn [: ,1] /= numpy.sum(valsUn [: ,1]) # positive proportions
valsUn [: ,2] /= numpy.sum(valsUn [: ,2]) # negagive proportions
valsUn [: ,3] = 2* valsUn [: ,1]+ valsUn [: ,2] # weighted proportions
# topic with scores
maxP = numpy.argmax(vals [: ,0]) # topic with thighest precision
maxPUn = numpy.argmax(valsUn [: ,1]) # topic with highest positive score
Chapter 3. Data Collection and Preprocessing 36
## you only need this if the positive values are too close to eachother ...
# select the second highest positives
tempValsUn = numpy.delete(valsUn ,maxPUn ,0) # delete highest from the table
max2PUn = numpy.argmax(tempValsUn [: ,1]) # select highest after that -> 2nd
vec = vals[maxP ,:] # load the supervised record for most precise LDA topic
vecUn = valsUn[maxPUn ,:] # load unsupervised record for most positive LDA topic
# compare the overall score on 2 most positive LDA topics
comparer = numpy.zeros ([2 ,4])
comparer [0 ,:] = vecUn; # record with highest positive score
comparer [1 ,:] = tempValsUn[max2PUn ,:] # record with 2nd highest positive score
# compute standard diviation of positive scores among LDA topics
std = numpy.std(valsUn [: ,1])
step2 = False;
maxP2Un = int(comparer [1 ,0])
vec2Un = valsUn[maxP2Un ,:]
# if difference between first 2 records is smaller than scaled s.d.:
if (comparer [0,1]- comparer [1,1])<ci*std:
step2 = True;
# write stats to file
file = open(’data/stats/statistics_ {0}. txt’.format(path), "w")
file.write(’n’)
for i in xrange (0, numTopics ):
file.write(’topic :{0}: {1}nn’.format(i,topics[i]))
file.write(’topic :t{0}t | precision :t{1}t | recall :t{2}tn’.format(i,vals[i ,0]*100.0 , v
file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}nn’.format(i,va
file.write(’nn Most Consistant topic nn’)
file.write(’topic :t{0}t | precision :t{1}t | recall :t{2}tn’.format(maxP ,vec [0]*100.0 , vec [1
file.write(’nn Results nn’)
file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}n’.format(maxPUn ,vec
file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}nn’.format(maxP2Un ,
if step2:
file.write(’values too close !!’)
sortedArr = qbRel.pickObs(topicDensity ,10, maxPUn)
topNdocs = ordData. declaration [sortedArr]
file.write(’nn Top {0} observations nn’.format (10))
Chapter 3. Data Collection and Preprocessing 37
for i in topNdocs:
file.write(’{0}nn’.format(i))
file.close ()
When analyzing, all the workers that have completed fewer than 100 jobs were ignored
as these observations provide very limited information compared to workers who have
done more jobs. Under this elimination criteria, 10 workers among 37 unique workers
remained. Table below provides the summary statistics of those workers.
Figure 3.2: Worker aggregated performance over lifetime
The explanation of the fields is as follows:
Worker ID: the unique workerID assigned by Mturk for privacy reasons
Jobs: the number of total HITs completed
Max Ration: number of times the worker scored highest score / number of total jobs
Min Ratio: number of time the worker scored least score/ number of total jobs
Mean: the average topic selection of the overall lifetime
Mode: The most frequently selected Topic
Mode Ratio: the ratio (%) the worker selected the most frequent label classification
Median: The median topic selection of the overall lifetime
From this table, we can observe that majority of the workers have obtained scores close
to 80% which means more of their classifications agree with other independent workers.
Another observation is that all the workers who have scores ¿80% have pleasing Min
Ratio values. The similarity between Mean and Median among all workers suggests that
the feedback text is uniformly being distributed among them for classification.
Chapter 3. Data Collection and Preprocessing 38
3.2.2.4 Experience of worker
Further analysis was carried to see if the lifetime of the worker (0%-100%) would have
an impact on the worker performance. Figure 3.2 plots how different workers maintain
their aggregate normalized accuracy score over their lifetime classifying the silver set.
Figure 3.3: Worker aggregated performance over lifetime
The plot in figure 3.3 depicts how the score of different workers change with respect
to the number of HITs they carry out. Each line on this plot represents a distinct
worker. For normalizing all the workers, the number of HITs was converted to percentage
completed (X-axis). Therefore all workers will be represented in terms of their lifetime
progress. The Y-axis is the normalized aggregated score. The Y value represents the
exact trust score worker has scored over his/her lifetime. The smoother lines with
detailed movements represent the workers who have done more HIT’s. The more rapid
lines with long straight-line segments represent workers who have done a small number
of HITs.
There are several important observations that can be drawn from this plot.
• The workers who have responded to less questions tend to do bad
• Most workers tend to do well from the beginning
• Majority of workers tend to do better than 75% accuracy
Chapter 3. Data Collection and Preprocessing 39
3.2.2.5 Unique Feedback Scoring
Just like assessing worker reliability, it is also possible to assess the complexity of a
feedback observation. By using the same concordance based approach as earlier, the
difficulty level of observations can be measured. The underlying assumption for this
claim is:
If an observation gets the same topic selection from multiple independent workers,
• The selection is more likely to be correct, as three independent workers have chosen
the same selection. Odds are very small for such an occurrence to happen randomly
when there are 11 choices.
• The observation is very likely to be easy to understand as multiple workers have
been able to select the correct classification for it
On the contrary, an observation gets the different topic selections from multiple inde-
pendent workers,
• The workers are very likely to possess less ability in finding the right topics and
more likely to be misclassifying the particular observation
• This also suggests that it is highly unlikely that they are classifying correctly hence
making the observation confusing and difficult.
With the above framework in mind, a concordance score based observation scoring algo-
rithm was built in order to evaluate the difficulty of different feedback observations. In
order to develop the difficulty scores for feedbacks, the following lists have to be defined
first,
1. RECORDS1: that contains the list of unique feedbacks in the dataset
• Consists of two columns feedback, Indices[index1, index2, index3]
• First column represents the unique feedback record
• Second column gives a list of indices where this feedback record is repeated
(as 3 replicates are present, each index list will have 3 entries)
The algorithm for scoring is presented in pseudo code below:
Algorithm 3.2:
Chapter 3. Data Collection and Preprocessing 40
foreach unique feedback in RECORDS1:
load the replicate observations using the indices
create a label count histogram for label occurrences (in all observations)
// to normalize the label count for the number workers
foreach entry in label histogram:
divide entry by number of workers for that feedback (1,2 or 3)
end for
// now we have normalized scores for each class within the feedback entry
// now lets start scoring the feedback value
initiate tempScore = 0.0
foreach label in the label count histogram,
tempScore += (normalized label count from histogram)
end for
// normalize the tempScore for number of labels in that feedback value
tempScore /= (number of labels in histogram)
end for
As same as in scoring workers, the observations of workers who have completed fewer
than 100 HITs were discarded. Figure 3.3 shows the histogram of the distribution of
concordance-based difficulty amongst 10000 unique feedback observations.
In figure 3.4, the X-axis represents the normalized score for feedbacks. It can span from
0.33 (1/3) – 1.0. The Y-axis represents the frequency of different scores in the dataset.
The histogram takes a step shape, as this is a cumulative histogram.
From figure 3.3, it is observed that most of the scores received are greater than 0.5.
It is also evident that the majority (close to 50%) of feedbacks have received a score
of 1.0, which means they have got full concordance. This means that there is at least
50% chance that a feedback will receive full concordance. It is also strongly visible that
almost 75% of the feedback observations have scored at least 0.5. These statistics are
supportive evidence towards the consistency of the classification process.
Chapter 3. Data Collection and Preprocessing 41
Figure 3.4: Step Histogram of Feedback difficulty score
3.2.2.6 Time of Day
It is also worthwhile to see if the time of day has an impact on the classification. Some-
times, the error rate may not be due to cognition or understanding, but due to drowsi-
ness, tire and other factors. Therefore it is interesting to analyze the worker statistics
and plot the worker performance on the silver set in terms of the time of day they re-
sponded. Figure 3.4 shown the exact score each observation scored on different time of
day.
Figure 3.5: Score for observations on different time of day (0001h-2400h)
Chapter 3. Data Collection and Preprocessing 42
Figure 3.5 plots the score for each observation against the time of day they were captured.
Each dot on the scatter plot represents a distinct observation. The X axis represents
the time of day they were captured (0000h-2400h). The Y-axis represents the score
obtained by the observation using algorithm 3.2.From this plot, it is observable that
there are no trends suggesting correlations between the hour of day to the performance
of the workers. It seems like the scores are scattered uniformly throughout the timespan.
Therefore, it is evident that the time of day has no significant effect on the score of the
observation.
3.2.2.7 Replicated feedback scoring
A selection criterion is formulated to select the most reliable classification amongst the
3 replicates. Rather than depending on the earlier derived worker reliability metric
to compare the 3 replicates, it is more sensible to use criteria that are dependent on
concordance. As same as above arguments, more concordance gains more confidence in
the correctness. The underlying assumption for this claim is as follows:
• 3 replicates can have different label combinations
• False positive classifications have a significant effect on training as it affects the
decision boundary
• False negatives do not have such an impact as a missing classification doesn’t
impact the decision boundary like a misclassification
• For example, if example A is classified [Price] and example B is classified [Price,
Shipping], the function should score example A more. This is because if label
shipping is a classification mistake, it will have greater negative impact on the
classification model.
The algorithm for scoring is presented in pseudo code below:
Algorithm 3.3:
foreach unique feedback in RECORDS1:
load the replicate observations using the indices
create a label count histogram for label occurrences (in all observations)
// to normalize the label count for the number workers
Chapter 3. Data Collection and Preprocessing 43
foreach entry in label histogram:
divide entry by number of workers for that feedback (1,2 or 3)
end for
// now we have normalized scores for each class within the feedback entry
// now lets start scoring workers
foreach observation having feedback value:
initiate tempScore = 0.0
foreach label in the chosen observation:
tempScore += (normalized label count from histogram)
end for
// normalize the tempScore for number of labels in that observation
tempScore /= (number of labels in observation)
end for
// after scoring, pick the highest scoring observation
compare the tempScore earned by each observation
select the observation with highest tempScore for the final dataset
end for
This algorithm will allow scoring each observation and to select the most reliable obser-
vation in terms of label concordance.
3.2.3 Selection and Rejection Criteria
Aforementioned analyses give a very good understanding about how the data is being
generated. These results can be used to define a few observation selection and rejection
criteria. Firstly, a few observation rejection criteria was defined to make sure only the
observations classified by the most reliable workers are being selected.
Reject all observations classified by workers who have completed less than 100 HITs:
There are a few motivations behind this decision.
• Figure 3.3 suggests that the workers who have completed a very few tasks tend to
underperform.
• Lack of continuous involvement may suggest that the workers are not committed
enough to the job.
Chapter 3. Data Collection and Preprocessing 44
Reject the first 5% of classifications by all selected workers: Figure 3.4 in section (SEC-
TION) clearly shows that all workers tend to do bad in the beginning of the task before
they start gaining on their cumulative score. The figure further shows that the pivoting
point falls between 0%-5% interval of their lifetime. Therefore, the interval between
0%-5% can be identified as a “burnout” period where the worker gets accustomed to the
nature of the job and adapt to it.
Reject observations from workers who had a final aggregate score of less than 75%:
Figure 3.3 suggest that all workers who have a cumulative score which is ¡75% have
been misclassifying continuously
Once the observations based on worker performance is eliminated from the initial dataset.
It is still left with a fair fraction of replicated observations that have been classified by
reliable workers. Rather than depending on the earlier derived worker reliability metric
to compare the 3 replicates, it is more sensible to use algorithm 3.3 to score the obser-
vations. Algorithm 3.3 uses label concordance to weight observations. This will assure
that the observation with highest concordance score gets selected to the final dataset
that will be used to train the supervised learning model.
As figure 3.5 suggests no time of day based trends, no observations were discarded using
this basis.
3.2.4 Directions for final data collection
Although it is possible to enforce the same data collection technique for remaining data
collection phase, it is very expensive as 3 HITs have to be allocated for each unique
observation. The remaining budget was sufficient for 45,000 unique HITs. If the earlier
method was devised, only 15,000 unique observations would be obtained.
The initial study is clear evidence that workers reliability and performance is satisfactory.
Therefore it is a waste of resources to empower such strong measures similar to the ones
of the initial data collection phase. It seams like overkill of resources. Therefore, more
lenient data collection methodology was enforced for the final data collection phase.
Figure 3.3 clearly shows that more than 5000 unique observations have obtained full
concordance. These observations have obtained the same combinations of labels from
3 independent classifications carried out by 3 independent workers. According to the
argument in section 3.2.2.3, a random occurrence of this nature is highly unlikely given
there are 11 choices. Due to this reason, these observations are assumed correctly
classified. Although this is a soft assumption, validity of this assumption is highly
Chapter 3. Data Collection and Preprocessing 45
probable. As this is a soft assumption, this data will be called the “Silver Sample” in
upcoming sections.
These fully concordant observations (observations that received the same classification
in all 3 replicates) are used to keep track of the performance of the final data collection
phase. For the final data collection phase, 40,000 new feedback entries were randomly
selected from the feedback database. No replicates were generated from this dataset.
After that, 5,000 observations were randomly picked from the set of fully concordant
observations. Then, these 5,000 fully concordant observations (Silver Sample) were
sprinkled into the 40,000 record dataset. Both the new records and the silver sample
records were merged into one dataset and shuffled. Shuffling guarantees that the silver
set is properly mixed with the dataset. Therefore, all workers are equally likely to classify
Silver Sample observations during the final data collection process.
This allows the requester to continuously analyze the overall health of the data collection
process by assessing how the current workers are classifying the Silver Sample.
3.2.5 The final strategy for data collection
The final strategy for data collection can be summarized as follows:
1. Publish the classification job in MTurk using the newly created dataset (40,000
new records + 5,000 silver sample)
2. Set up Bachelors degree as the minimum required qualification for workers
3. Constantly check how the workers are responding to Silver Sample examples
4. If the performance for Silver Set is bad, hold the job.
5. Then restart the classification job after a few days with a new set of workers
3.2.6 Final Data collection phase
Once the dataset was published for classification, 15,000 observations were classified in
2 days. The analysis showed that workers were classifying the silver sample observations
consistently. Therefore the job was not halted. Once the remaining 30,000 records were
classified, results were downloaded and analyzed.
Figure below shows that the workers in final classification task have performed very well.
Chapter 3. Data Collection and Preprocessing 46
Figure 3.6: The Score distribution of the silver set observations in final dataset
Figure 3.7 plots the cumulative histogram of score distribution of the final dataset. The
plot considers the 5,000 observations from the silver sample that was sprinkled in the
final dataset. The X-axis represents the similarity score between the old workers and
the new workers. A score of 0.0 means the new workers have classified the observation
completely differently from the 3 replicates from the initial phase. A score of 1.0 means
the new worker has classified the observation exactly as the replicates. Any score in
between represents partially similar classifications. From figure 3.5, more than 83%
agreement between the new worker opinion and the replicates is evident. These very
good results strongly suggest that the data classification process is very consistent.
3.3 The Final Dataset
The final Dataset was prepared in the following manner:
• Altogether 50,000 unique feedback records were classified (10,000 + 40,000 - 5,000)
• Observations from the Silver Sample were added
• From partially concordant observations in the initial dataset, the most reliable
observations were selected using Algorithm 3.3 scoring function
• The 40,000 unreplicated observations from the final data collection phase were
added to the dataset.
• Then the observations classified by unreliable workers were discarded using the
criteria specified in section 3.2.4
Chapter 3. Data Collection and Preprocessing 47
• Multiple duplicate records were detected in the two classified datasets. Only the
most reliable classification for these duplicates were chosen using algorithm 3.3
• 5,133 records were discarded due to above reasons
• The final dataset consists of 44,877 unique observations
3.3.1 The topic distribution of the final dataset
topic dist.png topic dist.png
Figure 3.7: Topic Distribution of the final dataset
From the histogram in figure 3.7, it is evident that there is no User Interface bias evident
in the final dataset. The large fraction of “None of the above” observations support the
trend observed in initial data collection stage (Figure 3.1). The high frequency in Price
and Navigation also follow from figure 3.1. Therefore, it can be concluded that the data
collection has be executed well.
3.4 Preprocessing steps
Once the data is collected and cleaned, the dataset is should be prepared for learning
tasks. A number of pre-processing steps are necessary in order to prepare the dataset for
Chapter 3. Data Collection and Preprocessing 48
supervised topic classifications. The text preprocessing steps used are standardization,
spell correction and stemming respectively. These steps have to be applied sequentially
for best results.
3.4.1 Text Standardization
Text standardization is used to make sure that all text complies with a single standard.
This is important to make sure that tokens that vary from each other due to Capital-
ization, punctuations and other grammatical rules are taken out. It also helps to make
the tokenization process less cumbersome by eliminating exceptional use of characters.
As part of text standardizing, several rules were applied to the text at hand.
1. All text was converted to Unicode encoding (UTF-8). This is important to convert
all the special character and non-Unicode data into Unicode data which is easy to
manage. Standardization enhances safety by eliminating probable exceptions due
to different encodings
2. All text was converted to lower case. This step is important for word standardiza-
tion.
3. All the special characters in the dataset was removed. Any character other than
letters and numbers were replaced by a space. This is important as special char-
acters are likely to affect the word tokenization process. They can also increase
runtime exceptions while transforming data
As the three above mentioned standardization steps are applied to the dataset, the
result obtained is a UTF-8 encoded text dataset with all lower case letters and numbers
only. Any character other than alphanumeric are removed. This dataset is very clean.
Standardization is a vital role in text preprocessing as special charactors and improper
encoding can cause a lot of problems in spell correction, stemming and text tokenization.
3.4.2 Spell correction
Spell Correction is required for this dataset as the records are customer-generated feed-
back content. Customer end users who generate feedbacks have no formal obligation to
create accurate content. Due to the commitment of the user and various other reasons
such as the urgency of the user leads to trivial spelling errors. These errors have to be
corrected as part of pre-processing to increase cleanliness of the dataset. Spell correc-
tion gets rid of a reasonable amount of removable noise in text data. For this study,
Chapter 3. Data Collection and Preprocessing 49
pyEnchant spell checking library is used. As the target customer-base of the feedback
collection system is Great Britain, British English dictionary (“en-GB”) was used for
spell correction reference. (AbiWord, 2005)
For spell correction, the misspelled word is always replaced by the first suggestion from
PyEnchant 1 library. Although this might lead to another erroneous suggestion rarely,
it suggests the correct word in often cases. Therefore, it helps to clean the dataset
reasonably. When the erroneous word returns no suggestions, the word is left, as it is
uncorrected.
3.4.3 Stemming
Stemming acts as a dimension reduction technique that will reduce different variants of
the same word to its stem form. Porter stemming algorithm (Porter, 1980) has been
used do stemming on the feedback dataset in this study, as it is very popular to give
good results. Porter stemmer is located in nltk.stem.porter module in NLTK 2 python
library. No special parameters changes were imposed on the stemming procedure.
3.5 Preprocessing sequence
Above-mentioned preprocessing should be done in a particular sequence to get best
results. First, the text standardization step is carried out. This eliminates all the
special characters except for alphanumeric characters. It also converts the whole corpus
to lower case.
Once standardization is complete, spell correction step is run. As the standardization
step has gotten rid of all the special characters, there is very little chance for exceptions
to be thrown at this phase. During spell correction phase, all the erroneous words are
replaced with the closest suggestion to them in the British English dictionary. If no
suggestions are found, the word is left unchanged. As there are no special characters in
the corpus, spell correction is less challenging. This is the reason why standardization
should be carried out before spell correction.
After spell correction, stemming step should be run. As the spelling has already been
corrected in the dataset, stemming will consistently reduce the text to the word’s root
form. Spell correction has to be carried out before stemming due to this reason. Once
stemming is complete, the corpus has completed all three pre-processing steps.
1
http://pythonhosted.org/pyenchant/api/enchant.checker.html
2
http://www.nltk.org/api/nltk.stem.html
Chapter 3. Data Collection and Preprocessing 50
3.6 Preprocessing pipelines
Although the three preprocessing steps mentioned in section 3.4 can generate a cleaner
corpus with reduced vocabulary, spell correction and stemming may discard a fair
amount of diversity from the corpus. For example, there might be special spelling trends
that are not found in standard dictionaries that should be captured. Also, the different
variants of words may be important towards better determination of topics. Due to these
reasons, spell correction and stemming might harm the topic categorizations sometimes.
Taking these possibilities to account, three pipelines were used to preprocess the corpus
for topic classification. The pipeline specification is identified using a bit string of length
3 which can hold values 1 or 0 in each bit position. First bit is 1 if standardization is
applied. Second bit is 1 if spell correction is applied. Third bit is 1 if stemming is applied.
Three pipeline specifications consist of different combinations of gradually progressing
preprocessing steps. Text standardization is trivial to preprocessing and hence applied
in all 3 specifications.
Pipeline Specification 100:
1. Only applies text standardization
Pipeline Specification 110:
1. Applies text standardization
2. Then applies spell correction
Pipeline Specification 111:
1. Applies text standardization
2. Is followed by spell correction
3. Finally, stemming is applied
Chapter 4
Topic Classification for Labelled
Observatoins
4.1 Introduction
Dataset has already been classified using Mturk and cleaned. The ultimate dataset
has been finalized and prepared for topic categorization. Now it is time to use this
labeled dataset to carry out topic categorization using supervised learning methods. The
idea is to start from a linear Support Vector Machines (SVM) algorithm and extend to
kernelized dual form to improve efficiency as justified in section 2.4.3.3. The kernelization
and extension to dual form is only necessary if the simple linear primal form is unable
to achieve promising results. Due to the near linear nature of natural language, it is less
likely that dual form would be necessary.
4.2 Implementation techniques
There are several considerations that have to be done when using SVMs in topic cat-
egorization. Feature extraction is an important component of it. Features are mainly
extracted in the form of tokens. Then they are transformed using TF-IDF transfor-
mation. Choosing parameters also plays a vital part in achieving good results in SVM
classification. Parameters are involved in features selection and classification phases. In
addition to this, sampling the training and test sets and cross validation also plays a
role in enhancing the reliability of results.
51
Chapter 4. Topic Classification for Labelled Observatoins 52
4.2.1 Extracting Features
At first, stop words are removed from corpus. This eliminates trivial stop words from the
dataset. As described in section 2.4.3, n-gram Bag of Words is a very popular method
of text tokenization. Using this feature extraction technique, Unigram, Bigram and
Unigram+Bigram text tokenization is used after stop word removal. Unigram features
depict the effect of individual words in the dataset. Bigram features allow the features to
retain the sequence information about words. Once tokenization is complete, TF-IDF
text transformation is used to weight the effect of different words. Tf-idf lends high
discriminates power to words that occur in a limited number of documents.
4.2.2 Selecting Features
Regularization plays a big role in feature selection as it enables model selection by
controlling over fitting. Regularization is theoretically justified as it tries to impose
Occam’s Razor on the solution model. The Bayesian point of view on this is similar to
imposing specific priors to model parameters. Lasso (L1) and Ridge (L2) are the two
main regularization methods applied in supervised learning models. Lasso regularization
is used in this particular problem for feature selection. This decision is inspired by a
number of properties of Lasso regularization.
• Lasso sample complexity only grows logarithmically with increasing number of
irrelevant features
• Logarithmic dependence on input dimensions reaches the best-known bounds for
various feature selection contexts.(Ng, 2004)
• Great for text data that has a large dimension space with a heap of irrelevant
features
• Lasso converges the weights of all irrelevant features to exactly zero. This allows
complete elimination of irrelevant features from the model
• Convergence to zero weight is suitable for natural language as irrelevant tokens
can be ignored completely
4.2.3 Selecting Parameters
Certain parameters can be set in SVMs. They are:
Chapter 4. Topic Classification for Labelled Observatoins 53
• Multi-label or multi-class classification: Multi-label assumes that each observation
belongs to one or more labels (classes). Primarily, One vs. Rest method is used
for multi-label classification. This method is where a binary classifier is built for
each label and the most confident classification is chosen.
• Multi-class assumes that each observation belongs to only one of multiple labels
(classes). Multi-class can be implemented using two main methods. One Vs. Rest
classifier builds N binary classifiers for N classes/labels. In multiclass scenario, the
highest confident label is chosen for the observation. One vs. One classifier builds
(N*(N-1))/2 classifiers for each combination of N classes/labels. Crammer and
Singer (Crammer and Singer, 2001) approach is another strategy to do multiclass
SVM classification.
• On the other hand, One vs. Rest classifier is used to conduct multi-label classifi-
cation. Multi-label classification is the scenario where each observation can belong
to one or more classes/labels.
• Regularization parameter: C is the main parameter that is important in the primal
form SVM. C controls the penalty for misclassification. A bigger C leads to more
penalization. This leads to more bent solutions that under-fit aggressively. A
smaller C leads to over-fitting problem. The most preferred technique to choose C
is using grid search.
• In the scenario where the dual form of SVM is used, a new parameter comes into
play, as a Kernel function will be devised. The choice of Kernel function itself is
a parameter. Different factors are considered when choosing the right kernel for
similarity.
• With the choice of different Kernels, additional kernel specific parameters are
added to the model. For instance, polynomial kernel adds an additional parameter
for the number of polynomials and the tradeoff between lower and higher order
terms.
4.2.4 Final Process Pipeline
Multiple experiments were run on topic classification using SVM as the primary learning
algorithm. Three corpora were created from the dataset using the preprocessing pipelines
specified in section 3.5 (specification 100, 110 and 111). Briefly the specifications are:
• 100 (Text Standardization)
• 110 (Text Standardization + Spell Correction)
Chapter 4. Topic Classification for Labelled Observatoins 54
• 111 (Text Standardization + Spell Correction + Stemming)
Initially, for each corpus, feature extraction was carried out using 3 tokenization specifi-
cations described in section 4.2.1. Initially 60% of the total observations were randomly
selected for training set. The remaining 40% was used as the test set. To ensure the
statistical accuracy of the results, 5-fold cross validation was used during model training.
Values (0.1, 0.5, 0.7, 1.0, 1.3 and 2.0) were used to tune the regularization parameter.
4.3 Results
Once topic classification is carried out, results have to be assessed before selecting the
best model for topic classification. The accuracy of different models was assessed using
misclassification error. Using misclassification error gives more intuitive and realistic
understanding of the results.
4.3.1 Evaluation metric
As the topic categorization problem is treated as a multi-label classification problem,
Hamming distance between the actual and predicated topic classification metrics was
used to assess misclassification error. Hamming loss (Tsoumakas and Katakis, 2007)
computes the fraction of labels that were incorrectly classified.
This is different from jaccard similarity score as jaccard similarity score correspond to
the subset accuracy where the labels predicted to an observation should be exactly same
for the classification to be identified as correctly classified.
4.3.2 Unigram Model Results
Unigram based classification model tokenizes the corpus on word-by-word basis. Words
are treated as independent features in this model. The primary motivation behind
building this model is to analyze if the individual word features have an effect on topic
classification model. As mentioned in section (SECTION), the models were trained for
corpora preprocessed under the 3 mentioned specifications (section 3.6).
The results show that when regularization parameter C=0.7, the models show best accu-
racy results in 2/3 classification models. Classification model trained with preprocessing
specification 111 performs the best.
Chapter 4. Topic Classification for Labelled Observatoins 55
Figure 4.1: Best results for Unigram model in specification 111
Topic/Label precision recall f1-score support
Delivery 0.96 0.85 0.90 1106
Images 0.84 0.71 0.77 905
Latency 0.79 0.50 0.61 730
MoreFunc 0.70 0.40 0.51 2187
Navigation 0.79 0.62 0.69 3298
Price 0.91 0.83 0.87 1975
Products 0.69 0.21 0.32 780
Range 0.68 0.36 0.47 1298
Size 0.83 0.77 0.80 971
Stock 0.82 0.58 0.68 752
avg / total 0.80 0.59 0.67 14002
Table 4.1: Topic wise classifier results breakdown (Unigram 111
From figure 4.1, it is evident that the best accuracy in tuning parameter C is obtained
when C = 0.7. It is also evident that all C values give very promising hamming loss
accuracy results with Standardization + spell correction + stemming specification. The
5-fold cross-validated average hamming loss accuracy of this model is 93.9%.
The breakdown of One-vs-Rest result breakdown for the overall classifier is given below.
From table 4.1, it is observable that Delivery and Price have performed extremely well
in terms of both precision and recall. And overall average precision of individual topic
classifiers is 80%. The evidence shows that almost all the topics are performing fairly
well with a linear separator. The support column also shows that the maximum number
of support data points used in SVM is 3298. For each One vs. Rest classifier, these
number of supports are very good values as they are around 10% of the training set
Chapter 4. Topic Classification for Labelled Observatoins 56
except for one occurrence. As the training set is 60% of the corpus, 10% of the training
set is approximately 2,700 (45,000 x 60% x 10%).
4.3.3 Bigram Model Results
Bigram tokenization is different from Unigram as Bigram model tokenizes two consec-
utive words together as one token. This method enables preserving the word order
information. In some applications, word order features can be very important as some
words do not make sense individually. The very phrase “Natural Language Processing”
itself is a great example for this. The bigram features were also fit to corpora that were
processed with the preprocessing specifications outlined in section 3.6.
Results show that bigram models perform well well the regularization parameter C =
1.0 in all 3 specifications. The results also show that Hamming loss accuracy increases
gradually when additional preprocessing steps are used with the corpus. According to
the results, specification 111 (Standardization + Spell correction + Stemming) performs
best among bigram models.
Figure 4.2: Best results for Bigram model in specification 111
From the results on grid-search based results in figure 4.2, it is evident that the model
performs best when C = 1.0 in bigram features. The average accuracy using Hamming
loss misclassification error is 92.3% when C = 1.0. The table below breakdown the
accuracy of label wise classifiers.
The breakdown of One-vs-Rest result breakdown for the overall classifier is given below.
From table 4.2, it is evident that Delivery and Price classifiers have done great in Bigram
feature space as well. But it is evident that recall values are not very promising in bigram
Chapter 4. Topic Classification for Labelled Observatoins 57
Topic/Label precision recall f1-score support
Delivery 0.95 0.62 0.75 1130
Images 0.83 0.44 0.58 925
Latency 0.83 0.34 0.48 724
MoreFunc 0.68 0.26 0.37 2223
Navigation 0.79 0.41 0.54 3366
Price 0.91 0.57 0.70 1947
Products 0.67 0.14 0.23 759
Range 0.66 0.17 0.27 1317
Size 0.84 0.49 0.62 934
Stock 0.82 0.41 0.55 711
avg / total 0.79 0.39 0.52 14036
Table 4.2: Topic wise classifier results breakdown of Bigram 111
space in comparison to unigram space. The number of supports is also very satisfactory
as they are all around the 10% mark.
4.3.4 Unigram + Bigram Model Results
Unigram features identify the importance of individual word features. Bigram model
preserves the word order and emphasizes on its importance to model building. Having
both unigram and bigram features in your feature vector enables to build the final model
using both the above mentioned aspects of the corpus. But the main downsides of this
approach are:
• The nature of feature extraction double counts words in unigram and bigram forms
• Unigram + Bigram features lead to a very large vocabulary
From the results for unigram + bigram features, It is evident that the Hamming loss
based accuracy gradually increases with increasing levels of preprocessing steps.
Figure 4.3 shows how specification 111 performs best. When C = 0.7, model outperforms
other parameter values in all preprocessing specifications. Following table outlines the
result breakdown of the label classifiers.
The breakdown of One-vs-Rest result breakdown for the overall classifier is given below.
From table 4.3, it is evident that Delivery and Price classifiers are performing fairly
well in this feature space as well. Recall values are close to performance in unigram
space. The number of supports is also very satisfactory as they are all around the 10%
Chapter 4. Topic Classification for Labelled Observatoins 58
Figure 4.3: Best results for Unigram + Bigram model in specification 111
Topic/Label precision recall f1-score support
Delivery 0.95 0.86 0.90 1087
Images 0.86 0.69 0.77 919
Latency 0.78 0.48 0.60 717
MoreFunc 0.70 0.39 0.50 2229
Navigation 0.79 0.60 0.68 3347
Price 0.91 0.81 0.86 1970
Products 0.69 0.18 0.28 779
Range 0.69 0.34 0.46 1305
Size 0.82 0.77 0.79 901
Stock 0.83 0.60 0.70 745
avg / total 0.80 0.58 0.66 13999
Table 4.3: Topic wise classifier results breakdown Unigram and bigram 111
mark. This feature space holds the least number of total supports although it only beats
unigram space by 3 support data points.
4.4 Conclusion
Support vector machines (SVM) algorithm is ideal for the topic classification phase
mainly due to its suitability to natural text data. The near linear separable nature
of text and the suitability of feature selection techniques like Lasso for sparse feature
vectors make SVM an ideal choice. Preprocessing steps also play a vital role in text data.
Therefore, 3 preprocessing specifications were used to generate 3 corpora from the same
user feedbacks. Results consistently show that Specification 111 (word standardization +
Spell Correction + word Stemming) transformation has performed best in classification.
Chapter 4. Topic Classification for Labelled Observatoins 59
When comparing the 3 feature extraction techniques, it is evident that Unigram features
and Unigram + Bigram features perform reasonable better. Bigram only feature space
performs fairly bad in comparison to others. Therefore, choosing bigram features is not
ideal. In contrast to Unigram features, Unigram + Bigram features take a fairly longer
time to train the model. The difference between training times is close to 0.5 seconds.
As Unigram feature space leads to a smaller feature space that generates best results,
Unigram features are the best choice for topic classification using SVMs. Grid- search in
unigram feature space provides empirical evidence that 0.7 as the regularization penalty
is the best choice for C.
Finally, SVM model on unigram features of the corpus obtained using preprocessing
specification 111 performs best for topic classification. The smaller vocabulary leads to
a shorter training time. It leads to 93.9% hamming loss-based accuracy. The precision
and recall in individual topics is also best in this model. The number of support vectors
on each label is also very satisfactory in this model.
Chapter 5
Topic Detection Automation
5.1 Introduction
From the results of figure 3.2 in Section 3.2.2.1 and figure 3.7 in Section 3.3.1, it is evident
that the larger fraction of observations does not belong to any of the pre-defined topic
labels. As the results of the topic classification process is satisfactory in terms of the
business requirement, it is worthy to attempt to build a topic detection framework that
will enable the engine to detect emerging topics that are not already in the dataset. This
is useful for the organization as it helps it to get an idea about the other topics customers
have feedback about. This will increase and widen the quality of insight clients will gain
from the feedback data. In the long run, topic categorization and detection will serve as
a robust unified system that will adapt to the dynamic feedback generation landscape.
The machine learning problem at hand in a latent topic detection process is an un-
supervised learning problem. As detailed in section 2.4, it can be looked at as both
a clustering or dimension reduction problem. After careful analysis of the techniques
available for topic detection, LDA is the ideal language model to use for topic detection
due to the reasons detailed in section 2.4.4.4.
The primary objective of the project is to automate the topic detection process as
much as possible. As there is already labeled data in abundance available from the
Mturk classified dataset, there is an opportunity to use this labeled dataset to explore a
potential automation strategy for parameter tuning. The experiment designed attempts
to use the labeled data to tune the ideal regiment of parameters that can be used
extract the topics from the data at hand. It goes on a further step to define a strategy
to automatically identify strong/consistent topics within the dataset. The description
in Experiment strategy will give a better idea about the strategy devised.
60
Chapter 5. Topic Detection Automation 61
5.2 Experimentation Strategy
There are several considerations when using LDA in topic detection. Feature extraction
is an important component of it. Similar to techniques in classification, features are
extracted in the form of tokens. Choosing parameters also plays a vital part in achiev-
ing good results in LDA. As LDA addresses an unsupervised learning problem, tuning
parameters is not as straightforward as a supervised learning problem where quantita-
tive evaluation techniques are readily available. In the case of topic detection, a fairly
popular technique is to manually assess the topic concepts being detected.
5.2.1 First Phase
At the midst of available labeled data, there is an opportunity to use the statistical
behavior of these labels to direct the parameter search in unsupervised case. Section
(SECTION) detailing the method will discuss the techniques and rationalization in
detail. The labeled dataset has 11 topics/labels including the “None of the above”
observation set. Initially this dataset was used in first phase to tune the parameters
to detect “some” of the labeled topics. This phase is helpful to understand the ideal
parameters of the LDA that is appropriate for the statistical nature of the corpus at
hand. The results of the topic inference are evaluated with the labeled dataset. The
evaluation metric will be described in section 5.4.1
5.2.2 Second Phase
The second phase focuses on tuning the threshold parameters that are important in
detecting new topics consistently. For this, 10 distinct datasets are generated using the
primary dataset where each dataset will contain the “None of the above” observations
and observations belonging to one of 10 remaining labels. The phase of the experiment
will attempt to train LDA models on each of the mini corpora. The objective is to
extract the labeled topic from each dataset.
5.2.3 Third Phase
Using the results from phase 2, a scoring function is built to measure the consistency
of the strongest emerging topic. The same evaluation metric used in phase 2 is used
to direct the experiment in the right path. The parameters such as specificity of the
scoring function, consistency evaluation technique has to be tuned.
Chapter 5. Topic Detection Automation 62
5.2.4 Fourth Phase
The forth and final phase of the experiment will use the verified parameters to run topic
detection on “None of the above” observation set only. Results will be manually assessed
using human intelligence and reported.
5.3 Method
Before initiating the major experiment, text has to be preprocessed. Due to the con-
sistent results obtained in all classification experiments, preprocessing specification 111
outlined in section 3.6 is used for feedback preprocessing. After the text undergoes stan-
dardization, spell correction and stemming, the transformed dataset is used to train the
LDA with stop word removal. The final dataset being used for LDA is unlabeled, as
labels will be discarded before training the model.
5.3.1 Phase 1: Tuning LDA parameters
The primary focus on Phase 1 was to tune the parameters for Dirichlet priors in the
dataset. The full dataset is used for this purpose. The parameters being tuned are,
Eta (η): Eta is the hyper-parameter for the Dirichlet prior that influences the spar-
sity of topic-word distribution. Bigger the Eta, densor the distribution. The
popular hyperparameter value for this parameter is 1/Number of LDA Topics. In
this experiement, grid search was carried out within the parameter range [ 0.001,
0.01,0.03, 0.05, 0.08, 0.09, 0.095, 0.099, 0.1, 0.115, 0.12, 0.13, 0.2, 0.5, 1.0 and 10.0
]
Alpha (α): Alpha is the hyper-parameter for the Dirichlet prior that influences the
sparsity of document-topic distribution. Bigger the Alpha, densor the distribution.
The popular hyperparameter value for this parameter is 1/Number of LDA Topics.
In this experiement, grid search was carried out within the parameter range [ 0.001,
0.01,0.03, 0.05, 0.08, 0.09, 0.095, 0.099, 0.1, 0.115, 0.12, 0.13, 0.2, 0.5, 1.0 and 10.0
]
Number of Passes: number of iterations the LDA will run. The more iteration will
tune the probability distributions more as each iteration step optimizes the dis-
tribution parameters further. It is important to run enough number of passes in
LDA to make sure that the final parameters have converged. Performance was
evaluated for multiple number of passes such as 50, 70, 100 and 200
Chapter 5. Topic Detection Automation 63
In phase 1, a multilevel grid search is used to train multiple models in parallel. The
grid search will permute the above parameters to find the ideal set of parameters that
regenerated the most of expected results. Once the models are built, the inference part of
LDA is used to infer topic distributions for individual observations. This topic inference
is used to evaluate the performance of the model.
In simpler words, expected result is the model that generates the most number of topics
from the initially labeled dataset.
5.3.2 Phase 2: Simulating an emerging topic
The primary objective of the second phase of the experiment is to use the available
labeled data to tune the threshold parameters for topic detection. The typical scenario
is the one where a new topic starts trending in the dynamic feedback collection system.
It is necessary to build a framework that automatically identifies this topic. According
to the scenario outlined, following can be assumed.
• The candidate data pool doesn’t belong to any of the pre-defined topics
• Usually, the candidate data pool doesn’t contain multiple strong topics trending
at the same time
• Therefore the new emerging topic represents a fair fraction of the unclassified
observations
The best approach to simulate this scenario using the data at hand is to simulate the
instance where each labeled topic is treated as a new trending topic within the unclas-
sified observations. In order to simulate this phase, 10 distinct datasets are generated.
Each Dataset contains the “None of the Above” observations and the observations be-
longing to each of the remaining topics. The topic labels are discarded before training
the models.
The following table outlines the number of observations in each dataset
As the ideal parameters for this phase has been already found using phase 1, the chosen
parameter combination can be used to tune the LDA model in each dataset.
5.3.3 Phase 3: Developing a Topic Consistency scoring function
The performance of the model is assessed using how well the final LDA model identifies
the labeled topic in each dataset. Using the results, threshold parameters and the
Chapter 5. Topic Detection Automation 64
Topic/Label Positive Documents Negative Documents Total Documents
Delivery 2863 2863 5726
Images 2292 2292 4584
Latency 1848 1848 3696
MoreFunc 4387 4387 8774
Navigation 7627 7627 15254
Price 5520 5520 11040
Products 1826 1826 3652
Range 3092 3092 6184
Size 2078 2078 4156
Stock 1463 1463 2926
Table 5.1: The number of observations in each dataset
final topic consistency scoring. The statistical behaviour of results from the phase 2
will be used to develop a scoring fucntion that will enable detecting consistent topics.
Futhermore, the inference probabilities are used to plot histograms to understand the
inference score distribution. This score is used to build an ideal scoring function that
can identify consistent topics. The results are evaluated using precision and recall of the
inferred documents with the labeled dataset.
5.3.4 Phase 4: Final topic detection
The final phase attempts to use the parameters and scoring function investigated through
phase 1 and 2 to detect emerging topics in the unclassified dataset. “None of the above”
observation set is modeled using the LDA with the parameters chosen in phase 1. Once
the topics are detected, the scoring function is used with the threshold parameters to
detect the most consistent topic. The chosen topic and the observations inferred to
belong to that topic are manually assessed.
5.4 Results
Once the topic detection experiment is carried out in the mentioned phases respectively,
the results were recorded and evaluated. The labeled dataset was used to evaluate the
results of the first three experiments as they were conducted with the use of the labeled
dataset. The final phase was a completely unsupervised phase that was qualitatively
evaluated using human expert opinion.
Chapter 5. Topic Detection Automation 65
5.4.1 Evaluation Metric
As mentioned above, phase 1 and 2 of the experiment was designed with the use of the
labeled dataset. The objective of the phases is to tune the parameters of the model
to regenerate the results of the labeled dataset. That is to use machine learning to
regenerate the results from the human intelligence task as much as possible. The best
method to evaluate this is to compare with the labeled dataset itself.
The final outcome of phase 1 is the same set of observations from the labeled dataset
with topic inference probability values. If the inferred probability of a LDA generated
topic is high for a fair set of observations that belong to the same topic in the labeled
dataset, it is highly likely that the LDA generated topic resembles the manually labeled
topic. The underlying assumption for such a claim is:
• It is more likely for labeled topics to be detected as the dataset contains observa-
tions that belong to pre defined labels
• It is more likely for a labeled topic to emerge, as those observations are present in
reasonable proportions in the dataset.
• Given such a topic emerges as one of the topics in LDA, the observations that
belong to that label in the labeled dataset is very likely to get higher inference
probability values
• Therefore most of the observations that belong to this particular topic will have
high probability for the same inferred topic and low probability values for other
topics.
• Similarly, observations that do not belong to that label in the labeled dataset will
have low probability values for the particular inferred topic
• If an emerged LDA topic shows the above behavior,
– It represents a topic that maps to one of the topics in the label dataset
– It shows specificity as it only shows good inference probability for that par-
ticular emerged topic
– This is a potential candidate for a consistently emerged topic
In order evaluate the above behavior; result-set from the LDA is transformed. Firstly, the
results dataset is partitioned into 11 groups of observations where each group contains
the observations that belong to the same topic/label in the labeled dataset. That is
Chapter 5. Topic Detection Automation 66
done using the labeling in the labeled dataset. Once this is done, the mean inferred
probability is computed for each group of observations for each LDA topic.
Given there are:
• N total number observations
• L number of labels in the labeled dataset
• K number of LDA inferred topics
This will generate a N × K inference matrix with pn,k representing the probability of
observation n belonging to LDA topic k.
After partitioning this inference matrix into L partitions using the labeled dataset, the
mean is computed for each LDA topic within the partition. Final outcome will generate
a L × K matrix where pl,k will represent the mean probability of each observation in
group l belonging to LDA topic k.
Phase 2 will use the same evaluation metric outlined above. The only difference will be
that 2 labels will be used instead of L.
1. None of the above: label that contains observations that do not belong to any of
the pre-defined topic
2. One of remaining: label that contains observations belonging to one of the remain-
ing topics
The mean inference between observations inferred to those two labels will be assessed.
5.4.2 Phase 1 Results: All topics datasets
This matrix will be plotted on a graph to investigate the topic consistency. Following
plots depict how the most consistent set of parameters emerge replicating the result
behavior described above. The topic references for the plots are as follows:
The X-axis in the figures represent the labels found in labeled dataset. The Legend is
given by the table below:
The Y-axis, as the figures suggest, represents the mean “normalized” probability. As
the topic inference module in LDA assesses the topic distribution for each observation
independently, the probability values are independent. In this study, the probability
Chapter 5. Topic Detection Automation 67
Ref Number Topic/Label
0 Delivery and Delivery cost
1 Images
2 Page loading time, errors (Latency)
3 More site functionality
4 Navigation, Search and Design
5 None of the above
6 Price
7 Products
8 Product Range
9 Size
10 Stock Availability
Table 5.2: The Topic legend for X-axis in inference plots
Figure 5.1: Topic Inference comparison at 100 passes
Chapter 5. Topic Detection Automation 68
values are normalized in terms of the topic distribution per observation. That is, the
sum of all normalized probability values for each observation adds up to 1.0.
From figure 5.1, it is evident that 3 topics are emerging from the dataset. They are
topic 5,7 and 12 respectively. Topic 5 seems to be having good inference probabilities
on Images. Topic 7 aligns with stock and topic 12 with Delivery.
Figure below shows the inference performance when the same sets of parameters were
trained for 200 passes with the same dataset. More passes guarantee that the distribu-
tions converge.
Figure 5.2: Topic Inference comparison at 200 passes
Figure 5.2 presents the topic distributions when the model was trained for 200 passes. It
is evident that 4 topics are emerging in this run. Topics 6, 7, 8 and 9 are emerging from
the dataset. The following figure details the resultant 17 topics that are being detected
with LDA for analysis outlined in figure 5.2.
By considering figure 5.2 and figure 5.3 to map the results together, there is strong
evidence that the 4 emerging topics have a very good mapping to labels from the labeled
datasets. Condiering the outcomes of ther parameter combinations, it is evident that
alpha 0.003 and eta 0.5 is the most suitable combination of parameters.
The topics detected map as the following table
Chapter 5. Topic Detection Automation 69
Figure 5.3: 17 topics inferred in LDA analysis
Topic in LDA model Topic/Label
6 Stock Availability
7 Size
8 Delivery and Delivery cost
9 Images
Table 5.3: Mapping of LDA topics to Labeled topics
As seen in table 5.3, the parameters are effective as it is able to detect 4/10 topics
within the dataset. This is very promising as the objective of the parameter tuning
experiment is to steer the unsupervised learning process into predefined results. Above
results show promise that this parameter combination is ideal to model the word and
topic distribution of the dataset at hand.
5.4.3 Phase 2 Results: Individual Datasets
In this phase, 10 datasets are used to simulate the scenario where an individual topic
is introduced to the dataset as a new emerging topic. LDA model is trained on the
datasets separately to quantitatively analyze the results. The model is trained on the
parameter set deemed to be successful in phase 1. The computing power required for
this phase is significantly smaller in comparison to phase 1 as the grid search step is not
Chapter 5. Topic Detection Automation 70
Dataset Topic Rank LDA topic Ref. precision recall
Delivery 1 3 0.89 0.57
2 5 0.31 0.04
Images 1 1 0.86 0.55
2 4 0.74 0.24
Latency 1 0 0.87 0.50
2 4 0.44 0.12
MoreFunc 1 2 0.67 0.22
2 1 0.83 0.23
Navigation 1 1 0.67 0.30
2 5 0.72 0.24
Price 1 2 0.19 0.05
2 5 0.90 0.26
Products 1 3 0.50 0.14
2 4 0.66 0.26
Range 1 1 0.89 0.51
2 5 0.28 0.08
Size 1 5 0.80 0.61
2 2 0.19 0.02
Stock 1 2 0.79 0.61
2 0 0.58 0.09
Table 5.4: Summary of performance statistics of emerging topics in different datasets
devised in this phase. Therefore, the models were trained with 300 passes/iterations to
guarantee convergence of the probabilities.
Visual plot is the most intuitive form of evidence suggesting this approach leading to
good results. When plotting the visual plots, the labeled topics are plotted in alpha-
betical order in X-Axis. The visual evidence is further backed with the topic list and
the precision recall statistics of the comparison with labeled results. The following table
summarizes the precision recall statistics obtained for the best emerging topic in each
dataset
The meaning of different column definitions are as follows:
Table 5.4 shows the best 2 emerging topics for each dataset with their respective precision
and recall performance results. The full experient results can be found in the Appendix.
The following sections will describe some of the good result sets that were obtained from
the afore-mentioned experiment.
Chapter 5. Topic Detection Automation 71
5.4.3.1 Delivery
The following figure represents the mean normalized probability obtained form Delivery
dataset. This dataset contains observations that belong to Delivery label and None of
the above label. Alphabetically ordered, X-axis is mapped as:
0: Delivery
1: None
Figure 5.4: Topic Inference comparison at 300 passes in Delivery dataset
Figure 5.4 gives strong evidence that topic 3 amongst other topics is strongly mapped to
Delivery topic. This is because it shows a very high mean probability value for inference
in the document collection belonging to Delivery label and a small mean probability
for other documents. No other inferred topic shows this behavior either. Table below
outlines the topic statistics for this model.
Table 5.5 clearly shows that topic 3 has 89% precision and 56% recall compared to the
rest of the candidate LDA topics that have 30% ish precision and 4%ish recall. This
gives stronger evidence that topic 3 maps directly to the topic labeled as “Delivery” in
the labeled dataset. Topic 3 detected using the LDA model is as follows:
Chapter 5. Topic Detection Automation 72
LDA topic ref. precision recall Postive Score Negative Score
0 0.29 0.03 0.14 0.18
1 0.03 0.01 0.16 0.17
2 0.37 0.06 0.16 0.17
3 0.89 0.57 0.22 0.11
4 0.32 0.04 0.15 0.18
5 0.31 0.04 0.17 0.18
Table 5.5: Summary statistics of Delivery dataset
0.041*deliveri + 0.034*free + 0.022*ship + 0.016*store + 0.013*order +
0.011*pleas + 0.011*cheaper + 0.010*charg + 0.010*make + 0.009*deliv
The above topic also clearly describes the keywords most common in observations related
to deliveries. Therefore, one can be very confident that the topic detection model has
worked very well for Delivery dataset.
5.4.3.2 Images
The following figure represents the mean normalized probability obtained form Images
dataset. This dataset contains observations that belong to Images label and None of the
above label. Alphabetically ordered, X-axis is mapped as:
0: Images
1: None
Figure 5.5 is strong evidence that there is a strong emergence in topic 1 to map to Images
topic in the labeled dataset. The behavior is very similar to the Delivery topic presented
in figure 5.4. This is strong evidence that emerged topic 1 depicts topic “Images”.
Topic 1 detected using the LDA model is as follows:
0.018*pictur + 0.013*like + 0.013*shoe + 0.013*imag + 0.013*view + 0.013*prod-
uct + 0.012*cloth + 0.012*color + 0.011*model + 0.011*item
The above topic also clearly describes the keywords such as picture, image, view and
colour that are most common in observations related to Images. Therefore, one can be
very confident that the topic detection model has performed well with Images dataset.
Chapter 5. Topic Detection Automation 73
Figure 5.5: Topic Inference comparison at 300 passes in Images dataset
5.4.3.3 Stock
The following figure represents the mean normalized probability obtained form Stock
dataset. This dataset contains observations that belong to Stock label and None of the
above label. Alphabetically ordered, X-axis is mapped as:
0: None
1: Stock
Figure 5.6 look reasonably different in comparison to 5.4 and 5.5. This is because the
label None is plotted as 0 in X-axis here. Observing the plot shows that topic 2 has
a very high mean probability inferred for documents that belong to class Stock in the
labeled dataset. The same topic shows very small probability for topic None. All the
other LDA topics show behavior similar to 5.4 and 5.5. This is strong evidence that
LDA topic 2 in the plot resembles label “Stock”.
Table below outlines the topic statistics for this model.
Chapter 5. Topic Detection Automation 74
Figure 5.6: Topic Inference comparison at 300 passes in stock dataset
LDA topic ref. precision recall Postive Score Negative Score
0 0.58 0.09 0.18 0.18
1 0.22 0.02 0.15 0.18
2 0.79 0.61 0.25 0.09
3 0.18 0.02 0.16 0.19
4 0.36 0.01 0.09 0.19
5 0.06 0.01 0.16 0.16
Table 5.6: Summary statistics of Stock dataset
5.4.3.4 Size
The following figure represents the mean normalized probability obtained form Size
dataset. This dataset contains observations that belong to size label and None of the
above label. Alphabetically ordered, X-axis is mapped as:
0: None
1: Size
Chapter 5. Topic Detection Automation 75
Figure 5.7: Topic Inference comparison at 300 passes in size dataset
Figure 5.7 look reasonably different in comparison to 5.4 and 5.5. But the plot looks
similart to figure 5.6. This is because the label None is plotted as 0 in X-axis here.
Observing the plot shows that topic 5 has a very high mean probability inferred for
documents that belong to class size in the labeled dataset. The same topic shows very
small probability for topic None. All the other LDA topics show behavior similar to 5.4
and 5.5. This is strong evidence that LDA topic 5 in the plot resembles label “size”.
One can see in phases 1 and 2 that the most consistent topic that maps to the labeled
topics in labeled dataset leads to high inference probabilities in respective observations
compared to the very low values for rest of records. Therefore, it is a fair attempt to use
this behavior to distinguish between the consistent topic and the rest. Higher probability
values gain more confidence on topic consistency. The underlying assumption for such
a claim is:
• A consistent topic will have a reasonable quantity of observations that belong to
the topic. Therefore the number of documents that belong to the consistent topic
is likely to be high in quantity.
• Due to the unsupervised nature of the learning process, the inferred probabilities
will not be extremely accurate in contrast to the labeled mapping
Chapter 5. Topic Detection Automation 76
• Observations that belong to the consistent topic will have high inference probability
over that topic. The inference probability of observations belonging to a consistent
topic is more likely to be high
• The inference probability for documents that do not belong to this topic is likely
to be quite small.
Due to these assumptions, the consistency scoring function should consider the following
factors:
• The probabilities of the documents belonging to a LDA detected topic
• Specificity is important in selecting observations for scoring
• The number of documents belonging to that topic
• The probabilities of documents that do not belong to that topic
By considering above factors, the ultimate consistency scoring function was designed.
In order to formulate a scoring function, the following lists have to be generated first:
TOPIC SCORES: That constains the inferred topic distribution for each observation
in the dataset
Then,
• Consists of N number of rows for N number of observations
• Consists of K number of columns for K number of LDA topics
• Where Pn,k represents the inference probability of observation n belonging to LDA
topic k
The above phenomenon can be empirically observed by plotting the histograms of prob-
ability values for topic datasets.
From figures 5.8,5.9,5.10 and 5.11, one can clearly see that 0.9 is the ideal threshold to
specify positive and negative documents. The documents gaining an inference probabil-
ity that exceeds 0.9 is considered belonging to a LDA topic. The documents gaining an
inference probability lower than 0.1 are considered not to belong to that topic. Then the
probabilities can be used to score the topics. From the figures 5.85.95.105.11 it is evident
Chapter 5. Topic Detection Automation 77
Figure 5.8: Topic Inference Histogram at 300 passes in Delivery dataset
Figure 5.9: Topic Inference Histogram at 300 passes in Images dataset
Chapter 5. Topic Detection Automation 78
Figure 5.10: Topic Inference Histogram at 300 passes in Size dataset
Figure 5.11: Topic Inference Histogram at 300 passes in Stock dataset
Chapter 5. Topic Detection Automation 79
that the most consistent topic has a large cumulative frequency (Area under the his-
togram) for documents that belong to the topic (Positive documents). That means the
most consistent topic manages to gain high frequencies of documents that obtain high
probability values for that particular topic. Therefore, the sum of positive probabilities
is a great indicator for topic consistency.
In order to formulate a scoring function for topic consistency, the following data is used
as input:
• DATASET: consists of all the documents and their MTurk label
• TOPIC DENSITY: that contains inference probability values for each topic for
each document
– Consists of L columns for each LDA topic
– Consists of N rows for each document
A scoring algorithm was developed using this mechanism. This algorithm is outlined in
pseudo code.
Algorithm 5.1 :
select TrueDocs where Docs belonging to the labeled topics in DATASET
foreach topic t in LDA model:
Select TopicVector of probabilities for topic t from TOPIC DENSITY
Select PositiveDocs where probability >= 0.9
Select NegativeDocs where proabability <= 0.1
Select TruePositives where (positiveDocs Intersection True Documents)
normalise PositiveDocs in terms of Topic Vector
normalise NegativeDocs in terms of TopicVector
compute sum of scores for PositiveDocs
transform NegativeDocs score with linear negative transoformation
compute sum of scores for NegativeDocs
compute precision for topic t
compute Recall for topic t
Chapter 5. Topic Detection Automation 80
store PositiveScore, NegativeScore for each topic t
normalise PositiveScore for all topics
select topic with highest PositiveScore
select topic with second highest PositiveScore
if the scores are 1 standard deviation apart:
report highest topic as the consistent topic
else:
report both highest and second highest topics
As algorithm 5.1 depicts, the score of the positive documents was used to automate
selecting the consistent topic. The experiment is phase 3 was run with the threshold value
of 0.9 with the scoring function focusing on the sum of scores. Frequency information
is lost if the mean score is used as the score gets normalized inconsiderate of how many
positive observations are present. Sum is suitable due to this reason. Table below
outlines the results of 2 most consistent LDA topics for all 10 datasets.
The python implementation of algorithm 5.1 is outlined below:
Once topic automation was run on the 10 different datasets, the labeled topic was au-
tomatically detected in 6/10 datasets. The labeled topic was selected within 2 most
consistent topics in all 10 datasets. One can conclude that these results are very satis-
factory for using the information from the same dataset to evaluate topic consistency.
The ideal threshold for specificity is 0.9. The sum of scores is a good metric to assess
topic consistency as it helps to represent both frequency and the score.
5.4.4 Phase 4 results: None of the above dataset
When LDA is run on “None of the above” dataset, topics are derived and manually
evaluated using the topics detected. The score statistics of the detected topics are
outlined in the table below.
It is clearly evident from Table 5.8 that topic 2 is the most consistent topic in the topic
pool. Topic 2 detected using the LDA model is as follows:
0.036*model + 0.029*cloth + 0.014*fine + 0.014*look + 0.013*wear + 0.013*like
+ 0.012*item + 0.010*ok + 0.010*store + 0.009*better
Chapter 5. Topic Detection Automation 81
Dataset Topic Rank LDA topic Ref. Positive1 Precision Difference
Latency 1 0 0.23 86.94 ≤ 1 S.D.
2 3 0.20 16.67
Images 1 1 0.24 85.67 ≤ 1 S.D.
2 3 0.21 6.36
Delivery 1 3 0.22 89.32 > 1 S.D.
2 4 0.17 32.33
MoreFunc 1 2 0.19 66.74 ≤ 1 S.D.
2 1 0.18 83.33
Products 1 3 0.19 50.10 ≤ 1 S.D.
2 3 0.19 50.10
Size 1 5 0.24 80.54 > 1 S.D.
2 2 0.18 19.47
Range 1 1 0.22 89.20 ≤ 1 S.D.
2 4 0.20 17.40
Price 1 2 0.20 19.08 ≤ 1 S.D.
2 4 0.19 26.90
Navigation 1 1 0.22 67.46 > 1 S.D.
2 4 0.18 49.43
Stock 1 2 0.25 79.05 > 1 S.D.
2 0 0.18 57.39
Table 5.7: Summary of score statistics of emerging topics in different datasets
LDA topic ref. Postive Score Negative Score
0 0.18 0.17
1 0.15 0.18
2 0.20 0.16
3 0.18 0.14
4 0.16 0.17
5 0.12 0.18
Table 5.8: Summary statistics of None of the Above dataset
The top 10 observations inferred to fall into the above topic are as follows:
1. improv have a mini clip of a model wear the item of cloth on a runway
so we can see how it look on also the way the item move
2. a pictur of a model wear each piec of cloth it s easier to judg how i d
look on you and what size you should get
3. more detail about the product mayb a video of model wear so we can
understand the textur and how the cloth sit on a bodi
Chapter 5. Topic Detection Automation 82
4. i think you should add some model to display the real sight of the cloth
wear on the bodi
5. would like to see the cloth on real peopl to get a better idea of how
they look netaport is a fab exampl of thi
6. could mayb use real life model to model cloth to get a better idea of
how garment would look
7. i like the fact that you can read other peopl s review about the cloth
and there are lot of differ imag of the model includ a video
8. put video of model wear the cloth so the buyer know how it would look
like in real
9. improv get new male model who want to look at bali autonom with
a beard on a fashion site and that other guy that remind me of that
other annoy actor daniel day lewi find some good look chap that look
like they re in echo the funnymen or felt or someth cheer from detroit
10. i think it d be great if you could have the cloth on real peopl and mayb
have a catwalk like the asp websit
The most consistent topic suggested in phase 4 suggests that the topic is related to
features. The top 10 observations provide more evidence to support this fact. Most
of the observations talk about adding a model and a real life image of the product to
enhance the experience. Technically this topic can be treated as a sub-topic under More
Functions label. But this decision is highly subjective.
Another observation from these results is that there are very similar feedback entries
within the top 10 observations. It is highly likely that these feedback entries were
collected from the same customer during his/her web session. According to how the
feedback collection systems works (section 1.2.2), it is likely for a particular customer
to be asked for feedback several times.
5.5 Conclusion
From the emerging topic detection phase, there are few conclusions that can be drawn.
Each phase in the multi-stage experiment focuses on achieving a particular goal.
1. Tune the LDA parameters for the dataset
Chapter 5. Topic Detection Automation 83
2. Tune threshold parameters for emerging topic simulation
3. Derive the suitable scoring function for topic consistency
4. To detect an emerging topic from None of the above dataset
Phase 1 has been very successful as the LDA model manages to detect 4/10 pre-labeled
topics by tuning alpha and eta parameters. Alpha = 0.03 and Eta = 0.5 is the ideal
combination of prior distribution hyper-parameters for the dataset at hand. Phase 2
and 3 also bring in successful results with 6/10 datasets managing to detect the most
consistent emerging topic to be the pre-labeled topic for that dataset. The threshold
for selecting negative and positive examples proves to be effective when the threshold is
set to 0.9. The positive score based on the number of positive examples and inference
probability also perform well. In the final phase, the new Topic “features” emerge from
the “None of the above” dataset supported by observations talking about adding a real
model and catwalk strongly supporting the topic. Although the topic “Features” can be
categorized under “More functions” label, “Features” can be treated as a independent
label as well. The final point will be discussed further in future work section.
Overall, one can conclude that the results from all phases of the latent topic detec-
tion experiments are promising and fruitful. These results give a promising avenue
to automated emerging topic detection. It also provides a good approach to devising
self-sufficient topic consistency measurement metrics that are independent of external
corpora.
Chapter 6
Conclusion and Future Directions
6.1 Introduction
After careful reference to literature around the text analytics sphere, data was recorded,
manually classified and transformed into carefully preprocessed corpora. Using these
corpora, topic classification models and emerging topic detection pipelines were built
and evaluated for different applications relating to the research problem at hand. After
carrying out a full topic classification task and a 4-phased experiment on automating
the emerging topic detection task, the main research findings are detailed below:
• Initial sanity checks and trust modeling heuristics such as assessing employee re-
liability can help clean up a crowd-sourced dataset.
• Text pre-processing plays a vital role in achieving best results in topic classification.
• Support Vector Machines (SVM) algorithm with Lasso regularization is a very
good supervised learning technique to use for topic categorization.
• Latent Dirichlet Allocation (LDA) is a very effective probabilistic language model
to detect emerging topics in a text corpus.
• Text labeling can be utilized as an effective tool to drive the parameter tuning in
an unsupervised learning setting.
6.2 Trust modeling for cleaner data
It is very important to understand the statistical behavior of a dataset if crowd sourcing
is used for classifying data. The possibility of worker’s motivation to be centered on the
84
Chapter 6. Conclusion and Future Directions 85
monetary reward can bias the worker to do more work in unit time compromising on the
accuracy of resultant work. Therefore it is essential to device a method to keep track of
worker reliability.
Initial sanity checks have shown that the topic distribution of the resultant dataset is
satisfactory. The analyses do not indicate any biases towards the UI structure of the data
collection process. The concordance factor based on the ensemble effect/ majority voting
is a very good tool to use in building heuristics around worker reliability. By making
multiple workers to label the same observations, it is possible to automatically evaluate
the correct labels to observations. Then this information can be used to assess the
reliability of workers by comparing how different workers respond to these observations.
By observing the results, one can conclude that the overall data collection process is
well executed. More than 80% of the old feedbacks concord perfectly (100%) with the
new ones. Results also suggest the following facts about the data collection phase:
1. Workers who perform a small number of jobs usually underperform
2. Most workers do better than 75% lifetime reliability score
3. Most workers tend to take about 5% of their lifetime to train the job (Burnout)
4. Time of day has no effect on the worker performance
Taking above finding to considerations, following conclusions can be considered justifi-
able in order to make the dataset cleaner.
1. Workers who haven’t completed more than 100 HITs are unreliable.
2. Removing the first 5% of the HITs from each worker will eliminate the worker
training errors.
3. When multiple opinions are present for a single observation, the most concordant
instance is most likely to be the correct one.
As the dataset is generated using customer feedback, data is error prone. Therefore,
one can conclude that text standardization and spell correction is necessary to restore
standards and accuracy in the dataset. Reducing the words in to their root form will
also increase accuracy by reducing multiple forms of the same word to one unique form.
As Lemmatization requires more computing and memory resources, the more simpler
and elegant stemming approach is ideal. It can also be concluded that the preprocessing
sequence is very important for right execution of the data enrichment process. The right
sequence also helps avoiding runtime exceptions due to logical errors.
Chapter 6. Conclusion and Future Directions 86
6.3 Topic classification with SVM
SVM is a great algorithm for topic classification as text spaces are mostly linearly sepa-
rable. The modularity and extensibility of SVMs complements the suitability of SVMs
to this project. After evaluating the results for topic classification, one can observe that
specification 111 (text standardization + spell correction + word stemming) performs
best in all experiments. Therefore it can be concluded that above specification is the
ideal preprocessing pipeline for topic classification. From the experimental results, it
is also evident that Unigram features perform best in topic classification. The Lasso
regularization is suitable for Bag of Words feature spaces as it minimizes most weights
to exactly zero. This is analogous to keyword selection for topics. According to results,
the best model is the Unigram feature model with pre-processing specification 111. The
ideal regularization weight for this model is when C=0.7. The hamming loss based
accuracy for this model is 94%. The precision and recall in individual topics is also
best in this model averaging to 80% precision and 60% recall. The linear space itself
achieves satisfactory results in the classification task. Therefore other Kernel functions
are unnecessary. The number of support vectors on each label is also very satisfactory
in this model. Therefore one can confidently conclude that this is the ideal SVN model
for topic classification in the dataset at hand.
6.4 Topic Detection with LDA
Once classification of the topics is complete, emerging topic detection is an ideal exten-
sion to this project. This is mainly due to the large fraction of unclassified observations
in this dataset. (None of the above labels that don’t belong to any of the predefined
topics). From empirical results, it is highly evident that using pre-labeled data to tune
the parameters in a dataset is a very smart approach. The both phase 1 and phase 2
experiments show that using pre-defined topics to steer a LDA model is a very effective
techniques to tune the hyper-parameters of the model. The LDA model in phase 1
manages to detect 4/10 topics in the dataset. Phase 2 manages to detect the emerging
topic in 6/10 datasets where the scenario was simulated. From both the experiments,
one could conclude that alpha = 0.03 and eta =0.5 are a good combinations of hyper-
parameters for this corpus. These hyper-parameters are ideal to detect topics that
inherit qualities such as word to topic weight similar to labeled topics. The results also
show that the model converges by 200 passes. With the analysis and results from the
histograms, it is evident that there are unique characteristics in the inference histogram
that can be utilized to derive the most consistent topic. Therefore, it is also possible to
conclude that inference probability and the frequency of high probability observations
Chapter 6. Conclusion and Future Directions 87
are important factors in evaluating the consistency of a LDA topic. When using the
inference probabilities to score topics, inference probability of 0.9 is the most suitable
threshold for selecting observations that belong to a particular LDA topic.
In the final phase of the experiment, the unclassified observations ‘(None of the above)
were used to train an LDA model with 6 topics. The model accompanied by the de-
veloped scoring function was able to detect the topic “Features”. The most scoring
observations for this topic also provide further evidence that this topic talks about fea-
tures of the website. Although one can objectively reason that the new emerged topic
falls under the topic “More Functions” which is already labeled, one can also argue
otherwise. This is highly subjective depending on the objective. However, a noteworthy
observation from phase 2 of Latent topic detection experiments (Single emerging topic
simulation) is that all 6 topics that successfully emerged from LDA simulations are very
finely defined topic with a specific scope.
Delivery, Images, Latency, Stock and Size refer to very specific and narrow scoped topics.
On the other hand, More Functions, and Product are more vague topics represent more
general topics with a wider scope. Therefore, it is possible to conclude that although
LDA accompanied with the scoring function helps to find consistently emerging topics,
having more specific topic with a narrow scope helps to achieve better results. This
behavior is also evident in topic classification task. The topics that have a well scoped
topics tend to have better precision compared to others.
Finally, looking at all the results and findings, one can confidently conclude that the
experiments undertaken under this project were well executed to obtain very satisfactory
results.
6.5 Potential Future work
The work-undertaken through this project has led to valuable insights in both within the
topic classification and topic detection tasks. The results have shown that the dataset
consists of fairly linearly separable text spaces. The detection phase has uncovered a
lot of insight into how a partially labeled dataset can be used to steer the unsupervised
learning task at hand. It unveils techniques to utilize available labeling to automatically
tune the parameters of the unsupervised learning task without aimlessly searching for
ideal parameter sets automatically.
Chapter 6. Conclusion and Future Directions 88
6.5.1 Topic Classification
There is potential work in classification that can complement the topic categorization
and add more value to the feedback content. Sentiment analysis is a sensible future
avenue to enrich the content. As a company focused on market analytics and conversion
optimization, rather than just knowing what topics their customers are talking about,
it is better to get an understanding about if the feedbacks are positive or negative.
For example, it adds more value to their services to acknowledge their clients if the
customers are expressing negative feedback on Delivery services than to say they are
talking about Delivery services. It gives more a more actionable and descriptive picture.
Sentiment analysis is a very rigorous research area that is being extensively investigated
at present.(Wang and Manning, 2012, Glorot et al., 2011)Sentiment analysis also poses a
lot of complex research questions compared to topic categorization as the word sequences
and other grammatical features may deem significant. Therefore there is also potential
to device String Kernels (Lodhi et al., 2002) to capture these features. The heuristics
utilized in evaluating worker reliability can also be used in assessing the labeled dataset
required for supervised sentiment analysis.
Before moving towards topic detection, a noteworthy observation one can see in both
topic classification and topic detection task is that the models tend not to perform
very well with vague topics compared to more specific topics. This poses an interesting
question if more vague topics have to be refined further. This is another worthwhile
question that has to be investigated to improve performance of the model.
As the topic classification task is a multi-label classification problem, it is also possible
to evaluate the feasibility of using a structured topic models. (Taskar et al., 2003) This
will enable building relationships between topics to get more insight into co-occurring
topics in the dataset.
6.5.2 Topic Detection
In terms of topic detection, there is a spectrum of creative work that can add more
value and improve the topic detection process. By taking the vagueness problem into
consideration, hierarchical topic models (Blei et al., 2004) can be considered for the
latent topic detection task. This will enable building more precise topic models that
can break down vague topics in to more specific topic concepts that are structured
hierarchically.
The topic detection results also suggest that there is a strong presence of topics rep-
resenting sentiments that are being detected. These topics emphasize on the keywords
Chapter 6. Conclusion and Future Directions 89
such as Love, Like, Good, Very, Everything and etc. . . Some evidence for this observation
is found in:
1. Topic 16 in figure 5.3
2. Topic 0 in Delivery dataset : Appendix
3. Topic 2 in Latency dataset : Appendix
4. Topic 2 in Navigation and etc . . .
It would be interesting to see how including words widely relating to sentiments in the
stop words list would impact the outcome of the topic detection process. By doing this,
the keywords that mainly associate with sentiments will be removed before the topic
detection process starts.
6.5.3 Crowdsourcing ++
Throughout the thesis, crowdsourcing has been used very effectively with techniques
and heuristics to refine results. A wonderful feature that is very useful in crowdsourcing
information is the ensemble effect that enables finding the right answer via majority
voting. In taking this work forward, this effect is a great tool to device in order to
extend the topic detection framework.
6.5.3.1 Using crowdsourcing to label the emerging topics automatically
Once a new topic is emerged from the dataset, majority voting can be used to choose
a label for the new topic. As seen in section 5.4.4, the top scoring observations for the
emerging topic relate to an abstract concept that can be described in one word. A good
method to get crowd opinion on the topic label is listed below:
1. Compute inference for the most consistent topic within unclassified documents
2. Select the documents that exceed the inference probability P of belonging to that
topic
3. Select N number of workers
4. For each worker,
(a) Randomly select K number of observations from the high inference observa-
tion set
Chapter 6. Conclusion and Future Directions 90
(b) Present them to the worker via a MTurk HIT job
(c) Ask the worker to respond by submitting one of word that explains the com-
mon concept within the K observations
An example scenario for this process is as follows:
From the dataset, select the observations that score inference probability > 0.9 to the
consistent topic. Select 10 workers and give them 3 randomly selected observations from
above set. Then ask them to submit a word that describes all 3 observations.
If this is done enough number of times, one word amongst multiple responses will emerge
as the label for the topic. It is possible to investigate how to select the parameters P,
N and K to automate this process. This study will automate the topic labeling process
with minimal costs.
6.5.3.2 Automatically building the decision boundaries for the new topic
Once the topic label is selected, the next objective is automatically classifying documents
to the new label. A decision boundary is needed to do this classification. As there are
no pre-labeled observations to train a classifier, the labeled data has to be generated.
Automating this process will reduce the human effort involved in large scale. From
the results outlined in section 5.4.3, the histograms in figures 5.8-5.11 show how the
inference probabilities are distributed in the consistent topic. These results suggest a
potential technique to automate the new label classification process. The process is
outlined below:
1. Compute inference for the most consistent topic within unclassified documents
2. Select the documents that exceed the inference probability P of belonging to that
topic as positive documents
3. Select the documents that have lower inference probability than (1−P) as negative
documents
4. Use a semi-supervised technique such as Semi-Supervised Latent Dirichlet Alloca-
tion (ssLDA) or Label Propagation to infer the labels of documents in between.
From the results of the semi-supervised method, the observations that are close to the
decision boundary can be identified. One can use crowdsourcing to classify these obser-
vations to gain more confidence on the decision boundary. The ideal process is outlined
below:
Chapter 6. Conclusion and Future Directions 91
1. Compute confidence for each observation
2. Select observations with lowest confidence probabilities
3. For each observation:
(a) Select a worker
(b) Present him/her with the observation
(c) Present the newly labeled topic as one of the choices
(d) Present “None of the above” as one of the choices
(e) List some of remaining labels as alternate choices
(f) Ask worker to select he most appropriate label for the observation
This process will allow the system to automatically classify doubtful observations near
the decision boundary to get more information. Then the system can automatically
update the model with new information. If collecting feedback from one worker is
unreliable, it is possible to use the same majority voting technique by collecting multiple
opinions.
Research and investigation is necessary to understand how to implement such an au-
tomated system successfully. If learned, such a system will immensely complement the
text analytics engine that has been developed throughout this thesis. If successfully im-
plemented, the complete system will not require any human intervention at all (Except
for sanity checks that have to be carried out from time to time).
Figure 6.1 shows how the whole ecosystem would work together as a cycle. The steps
labeled green have been already developed through the thesis. The steps labeled black
are the potential future work that will complete the self-managed text analytics en-
gine. The potential engine will automatically adapt to newly emerging topics and grow
continuiously with no human interventions.
Chapter 6. Conclusion and Future Directions 92
Figure 6.1: Final System with future developments
Chapter 7
A change of perspective. . . CISCO
7.1 Introduction
Being offered to join Cisco Systems on a 12-month internship is surely a life-changing
event. Having offered to it in San Jose, California in their headquarters is a dream
too good to be true. UCL students are privileged to have this opportunity and pursue
an international internship with Cisco Systems in Silicon Valley for one calendar year.
Through this year, they will be assigned to one of the teams in Cisco and challenged as a
normal employee to give them first hand experience in working in a large corporation. An
offering like this is the perfect opportunity for a graduate student like me to Jumpstart
my career.
7.2 Goals and Objectives
Having finished the taught modules of my masters course in Computational Statistics
and Machine Learning, my primary objectives from the Cisco internship were not solely
limited to gaining more practical experience in machine learning and data science. There
is a massive data center startup ecosystem in London and UCL graduates are always on
demand. Having experienced working with several research internships while in London,
I also had the thirst to gain full experience playing the role of a professional. My primary
objectives of the internship were the following:
• Career development
• Industry experience
• Experience Organizational culture
93
Chapter 7. A change of perspective. . . CISCO 94
• Networking
7.2.1 Career Development
As a recent graduate who would have to prepare for a career path, an organization
like Cisco is a perfect place to get that training. The opportunity to work and train
with experienced and talented teams gives you a very unique training that is very hard
to get from a small or medium size organization. A company like Cisco gives you a
very good framework to engineer within the well defined processes and management
procedures that are vital to address the complexity of the work. There are pre-defined
specifications for reports, processes, product life cycles and agile standards that are
practiced everyday in office. This experience is very important to grow as a team player
and to improve your development process. The work is also of a completely different
scale from my previous experiences with budding startups. To gain and train in this
new domain is one of the main objectives I chose Cisco.
7.2.2 Industry Expertise
Cisco is a key player in the IT industry. Cisco is the trendsetter and global leader in
network infrastructure. Having the chance to work in such a company and play a vital
part in realizing their next generation products would be an opportunity that will give
me heaps of experience. I was chosen to work in a Research and Advanced Development
team to do research on Internet of Things (IoT). Internet of Things being a very recent
and a booming field, I saw this opportunity to extend myself into a unique expertise at
such an early stage of my career. Gaining domain expertise on using Machine Learning
in IoT and cyber security, as a whole was one of the key objectives my internship at
Cisco.
7.2.3 Learn from the organizational culture
Having worked in several startups and research institutes before, I was very curious
to experience and understand the corporate culture of a large corporation like Cisco.
Companies like Cisco has to deal with a great amount of complexity when engineering
products due to the heap of moving parts a that they have to consider when changing
products and services. Due to the scale of work and human force involved, a more
hierarchical organizational structure is also evident in large corporations. I was always
curious to understand and work in such a setting to get a better understanding about
handling large-scale projects with high impact broader scopes. One needs to develop
Chapter 7. A change of perspective. . . CISCO 95
work ethics and deciding to coexist in a complex system like that. Another primary
objective of my internship was to develop the work ethics, deciding and professionalism
that will provide me a lot to grow as a professional in the industry.
7.2.4 Networking
Cisco, Silicon Valley is a great place to meet and associate with like minds. Cisco it-
self gives me exposure to many engineers, data scientists, managers and scientists that
already actively take part in developing next generation products and services. In addi-
tion to this, the Silicon Valley culture enables networking with various industry persons
via the wide range of meet ups, hackathons, tradeshows, conferences and various other
social events being hosted by the mass of tech companies in the area. Social interac-
tions between engineers in the area is very rigorous and they all share technological and
engineering interests alike.
7.3 Background Context
At Cisco, I was assigned to one of the Chief Technology Officers’ teams. I was involved
in the work for a whole calendar year under the Cisco International Internship Program
(CIIP)
7.3.1 Cisco
Cisco Systems, Inc. is a multinational organization headquartered in San Jose, Cali-
fornia. Cisco primarily makes network infrastructure equipment and provides services
around that domain. The main services provided by Cisco are based around secure col-
laboration, data Centre management, cybersecity and communication. Cisco’ s business
model is mainly focused on large scale enterprise customers and large organizations such
as governments. Cisco operates with numerous engineering offices all over the globe with
their presence in all corners of the globe.
7.3.2 CIIP
Cisco International Internship Program (CIIP) is a one-year long internship program
where students from 12 international partner universities are brought to the USA to
live and work in the USA. Students are selected to the program via interviews. The
candidates comprise of all academic levels from undergraduate levels to Doctoral level.
Chapter 7. A change of perspective. . . CISCO 96
While the internship in USA, Cisco takes care of the flights, accommodation, wages and
other formalities such as bank accounts, visa proceedings and etc. . . Interns are given
an initial induction and then placed in their groups to work for one year as part of
that functional unit. The most amazing part of this internship is that it provides a one
year commitment to the team that enables the team to utilize the intern in long term
high impact projects. The teams also see the interns more like a team member than
an intern, which challenges the intern to live up to industry expectations. Apart from
the technical assistance, the program also contributes to give the international interns a
fully-fledged cultural experience in the USA by organizing different events from time to
time. The nature of the system allows interns to deliver high impact outcomes (patents,
research papers and etc. . . ) which positions the CIIP internship program amongst one
of the most prestigious internship programs within Cisco.
7.4 My role and responsibilities
At Cisco, I was assigned to the Office of CTO: Security Business Group (OCTO-SBG)
as a Research and Advanced Development Intern. OCTO-SBG team is headed by the
CTO of Security organization at Cisco. This team mainly focuses on driving the thought
leadership and strategic development in the Security organization. The main outcomes
of the team are Proof of Concept (PoC) prototypes that demonstrate the technical
feasibility of new ideas. These PoC s are then reviewed by other vice president in the
organization who sponsor them from their budgets for productization.
The team consists of
• Cisco fellows and Distinguished Engineers who are involved in thought leadership
within the orgamization
• Research and Advanced Development Engineers who help the distinguished engi-
neers, to be be able transform their Proof of Concept Ideas to working prototypes
• Chief of staff who manages the staff of CTO (which is the team)
Throughout the internship I worked with one of the Distinguished Engineers in the team
to develop data sensors and analytics for Internet of Things (IoT) Traffic.
7.4.1 My project
Information Security and Threat defense has come a long way since Internet’s first launch
in the late 1960s. The number of wireless devices themselves (not including mobile
Chapter 7. A change of perspective. . . CISCO 97
Figure 7.1: The structure of OCTO-SBG
Chapter 7. A change of perspective. . . CISCO 98
phones) which operate without human intervention (such as weather centers, smart
meters, Point of Sale units etc. . . ) is expected to grow to 1.5 billion by 2014. (Lien et al.,
2011) With the miniaturization of devices, the increase in computational power and the
advances in efficient energy consumption, this trend tends to continue towards Internet
of Things (IoT) . With this new trend towards leveraging smart access and control
towards smaller network units such as mobile devices, sensor networks etc., industrial
systems and technological sectors too research heavily on incorporating the opportunities
bloomed by IoT in improving their business processes to maximize performance. With
such technology advances, there is an explosion of devices that are coming on the Internet
as “Everything” from medical wearables to power plant infrastructure are now in the
network. Due to this reason, there is a strong need to address the security and scalability
of both the devices and the Internet.
In the evolving realm of IoT, Enterprise is constantly pushing the connectivity domains
of network far beyond the conventional IP based networks to gain control and visibility
into finer details of the IO devices and sensor networks. Manufacturing sector is one of
the domains that is reaping very high return on investment through IoT. Cisco Internet
of Everything (IoE) value index interprets the leading nations in IoE to have the longest
track record in Machine to Machine (M2M) and Mobility innovations.
7.4.1.1 Problem Background
Although extensive research and development is underway for security and Threat de-
fense within the Enterprise and general Internet, the evolution of IoT creates a whole
new domain of security related problems. As much as IoT facilitates new avenues to
communication and control of machines and sensors, it also enables new forms of at-
tacks.As such, the need for securing these communication channels and monitoring them
is heightened to allay these potential new attack vectors. (Vyncke, 2013). Although the
amount of extensive research going in to securing wireless IoT networks (Booysen et al.,
2012), there is a strong need to give more attention towards securing M2M and Human
to Machine (H2M) wired networks.
The main goal of M2M communications is to enable the sharing of information between
electronic systems autonomously. (Niyato et al., 2011) A Major challenge in the IoT
domain in contrast to the Internet is the flexibility of the tools available within this
space. The wide array of communication protocols that extend beyond the TCP/IP
stack within the industrial domain itself makes it difficult to the currently available
threat defense mechanisms to be applied in securing industrial networks.
Chapter 7. A change of perspective. . . CISCO 99
Numerous feature engineering tools are being developed for threat defense in IP space
for open source tools such as Snort. There is a high demand and necessity of extensive
feature engineering for industrial protocols such as the SCADA and build threat defense
models that can analyze and monitor network traffic in IoT enabled networks. The
merge of the IP-Based Networks with the Non-IP networks also creates an opportunity
to understand the relationships between the two levels of the network and validate using
the relationships to predict the network behaviors. With data in abundance, there is a
strong need to create a set of rules that can be used to
• gain more visibility into IoT networks in the short run
• Conduct Anomaly detection and threat defense in the long run
7.4.1.2 The goal of the research project
The primary goal of this research is to understand the statistical behaviors of IoT traffic
and build a threat defense framework around the traffic patterns of M2M “things”. The
objectives of this research are as follows:
• Understand the data requirement of the problem
• Build appropriate sensors to capture the required data
• Try to come up with creative ways to visualize the M2M data through the infor-
mation extracted
• Apply machine learning techniques to gain more visibility into the network
Initially, the research is launched with a full review of all the literature about threat
defense in IoT that has been published since IoT gained attention. All the threat models
and solutions that have been suggested can be analyzed and reviewed to satisfy objective
1. Then get an understanding about Common Industrial Protocol (CIP), the primary
network protocol in research.
The appropriate packet capture data required for the study (objectives 2 through 4)
can be collected using network protocol analyzer software such as Wireshark R 1 in a
IoT network setting. When the network involves an IT component as well, the analyzer
software will also allow capturing human application protocols such as SMTP (email).
All packets can be captured in .pcap format that can be then converted to CSV and
other useful formats using a tailor made data sensor built in C++.
1
https://www.wireshark.org
Chapter 7. A change of perspective. . . CISCO 100
Figure 7.2: Goals of gaining insight into IoT traffic
Once the sensor is built, data will be recorded and as an initial step, visualization plots
will be built to gain immediate visibility to the IoT network. Supervised learning is used
to build a model to predict the device type of different devices connected to the network
(Water pumps, actuators, Programmable Logic Controllers and etc. . . ).
As this project mainly deals with Internet of Things space, my mentor and I had to
collaborate with several external parners of Cisco who directly vender product for en-
terprise manufacturing spaces. Due to this reason, this project was protected by several
Non-Disclosure and Cisco Confidential Agreements. Due to this reason, I am not in a
position to elaborate much on the details of the project.
7.4.1.3 Results
After carefully studying the CIP protocol specification, I pointed out that there are
some important attributes in the protocol (features) that will allow us to understand
the behaviors of devices. The following figure points out the packet dissection of the
CIP protocol.
We designed a data sensor that can capture these features from the network traffic data.
A patent application is already filed for this method. Once the designing was finalized,
I built the sensor using C++.
Chapter 7. A change of perspective. . . CISCO 101
Figure 7.3: CIP Packet
And the used the sensor to validate some sample traffic captures we had in the labs. By
using the summary reports I built out of the sensor, we developed a visualization suite
that allows an analyst
• to see the different network connection in the factory plant,
• to see the major traffic trends
• to see the composition of the traffic
Figure 7.4: Visualization application
Furthermore, I was involved in building a machine learning based classifier to identify
device types from the attributes of the packets. We formulated a method to fuse the
operational layer data with Information layer data to derive labels for the device type.
Chapter 7. A change of perspective. . . CISCO 102
Then we built a classifier to look at the operational layer data only (which is visible
to the network) and classify devices. I am not allowed to allowed to elaborate on the
technical details of the models.
7.4.1.4 Project outcomes
The following are the outomes of the project
Patent application:
Method for providing IoE aware Netflow
Patent application:
Method for creating IoE behavioral models using Supervised Learning techniques
Internal Whitepaper:
IoT Sensing and Device Behaviour Modelling
Technical Specification:
Device classification in Indus. networks using Machine Learning
Architectural Specification:
Architecture for IoT aggregated Netflow for CIP protocol
Report:
IoT sensing App Development Report
7.5 Non technical aspects
Apart from the technical experience I received at Cisco, working in the Silicon Valley has
given me a brand new attitude towards data science and software industry. Silicon Valley
operates in a completely different level to what I have witnessed with my experience in
Sri Lanka and my brief work experience in London. Although there is a mass of technical
vectors in California, the cultural and social experience is also noteworthy.
American culture is quite unique in comparison to European and Asian cultures. It is
a very diverse culture where everybody has a place to rest. People are very liberal and
welcoming. Apart from the openness of people and their extraordinary admiration of
freedom, the unique sports such as American Football and Ice hockey also complements
the unique experience. The support and friendliness of people is also very heartwarming.
Living in California with 70 more international interns who crave to experience the
Chapter 7. A change of perspective. . . CISCO 103
foreign culture unveils a great opportunity to explore America while making new friends.
You get the chance to explore your friends culture while you explore America together.
Silicon Valley is the heart of technology in the USA. The majority of the community hear
is involved in IT industry. There are so many social vectors that push your knowledge
and experience in the Silicon Valley. Some examples are:
• Univerisities
• Meetups
• Companies
• Hackathons
7.5.1 Universities
The primary workforce supplier to the silicon valley are located right in the valley
itself. Stanford University is located in Palo Alto, a town considered to be part of
the Silicon Valley itself. University of California in Berkeley (UC Berkeley) is right
across the Millennium bridge. In terms of machine learning, these two universities
have some of the most popular machine learning faculty in the world. Cisco itself build
research partnerships with these institutions and these partnerships enable Cisco to gain
knowledge in the latest research landscape and vice versa.
7.5.2 Meetups
Similar to London, meetups are very popular in Silicon Valley. There is a very dynamic
and ambitious crowd constantly pushing the community to learn and educate each other
by sharing their experiences with technologies. There are numerous meetups related to
machine learning itself that take place on weekly basis. These meetups are focused on
Some of them are
• SF Data Science Meetup
• Bay Area Machine Learning Meetup
• NLP Journal club
• Spark Meetup
• Bay Area NLP Meetups
Chapter 7. A change of perspective. . . CISCO 104
• Bay Area Big Data Meetup
Apart from machine learning and big data, there are other meetups that are very useful.
Some of them are,
• Cyber security Meetup
• Bay Area IoT Meetup
• Quantum computing Meetup
This social setting is complemented with community spaces where engineering and build-
ing new things are encouraged. Hacker Dojo Centre is a great example for this. Another
amazing fact about these meetups is that you get to meet brilliant minds. Meeting
Professor. Trevor Hestie in a meetups about Gradient Boosting Machines and having
the privilege to ask him questions personally are some of the most memorable moments
that wouldn’t have realized not for the international internship.
7.5.3 Companies
Silicon Valley is one of the most rigorous Tech startup region. Most of the biggest
players in tech industry are headquartered here. Due to this reason, there is a lot of
promotions and events that are organized by these different companies to grow their
image and attach talent. There are events organized for young tech enthusiasts and
college graduates from time to time where these companies will share the sophistication
of work they do. They educate young engineers to inspire them to work with them and
to build up their image. These events are very exciting and knowledge wealthy at the
same time. You get the opportunity to go talk and discuss technical details with the
very engineers who built those products.
It provided me the opportunity to understand how some companies use large scale
data processing and machine learning frameworks such as Hadoop and Spark to cope
up with the scale of work they do. It has personally swayed my interest to drive my
research interests towards large scale machine learning and Large scale natural language
processing. I have seen companies such as 0xData and Data bricks use Map-reduce
frameworks such as H2O and Spark to build systems that can work with TBs of data on
daily basis. these personal experiences have triggered a personal interest in my career
prospect to work hard in growing in to an expert in large scale machine learning.
Chapter 7. A change of perspective. . . CISCO 105
7.5.4 Hackathons
Hackathons impact positively towards shaping a early stage professional. I have devel-
oped a lot of skills relating to but not limited to, managing strict timelines, balancing
between quality and quantity over time, teamwork, learning new things. They are a
great opportunity to be creative and think out of the box with limited resources pro-
vided. Hackathons also provide an excellent setting to learn and try new technologies.
Apart from the technological uplift, you get the opportunity to meet new people with
alike interests and see other people’s work. You also get the opportunity to show your
creations to industry experts, entrepreneurs and investors to get their valuable feedback.
7.6 Benefits
I have gained heaps of experience and expertise in both software engineering and machine
learning by pursuing this project. They are both technical and non-technical.
7.6.1 Technical Benefits
Cisco International Internship Program has helped me grow as a technologist in several
ways. Some of the benefits are listed below:
7.6.1.1 Machine Learning and Large Scale data processing
Though the work I carried out at Cisco, I have learned to think out of the conventional
frame of machine learning and to be creative to get the work done. For instance, initial
labeling of the data shouldn’t necessarily be a human intelligence task. The device label
classification is a good example for this. Another valuable experience I got from my
work at cisco is to work with Large Scale data processing as well. Cisco has the capacity
to capture and use very large datasets that are very difficult to be managed in a personal
computer. The datasets can span to TBs and sequential processing of this data will take
weeks to finish. At this scale, large scale distributed paradigm is very a very effective.
Having to work with very large datasets, I gained first-hand experience in using apache
Spark and Apache Hadoop for distributed data processing.
Chapter 7. A change of perspective. . . CISCO 106
7.6.1.2 Training to manage a large pool of computing resources
Due to the massive scale of the work, a large spool of resources is required to handle
the datasets and computations needed to process these dataset. I have trained myself to
work in a Cisco Unified Computing System (UCS) blade servers that can spin around
20-40 virtual machines at the same time. I got the privilege to manage my own UCS
server to
• Spin Virtual Machines for
– Web servers
– Computing servers
– To install cisco products used in Proof of Concept Demos
– To build virtual clusters
• Build virtual clusters to carry out distributed computing
I have also learned how to manage my resources and maintain my computing resources
with minimal support from Lab administrator.
7.6.1.3 Building Data Sensors
Data doesn’t always come in the most usable form. The ability to do background
research and build the pre-processing tools is a good skill to have. I have learned to
build data sensors that can analyze row packet capture data and output records that
contain sensible information that can be used for model building. Working with network
traffic based anomaly detection at Cisco, I mastered the skill of using external libraries
to build the right sensors that can transform row data into more meaningful record sets.
During the process I learned a fair amount of C++ and also about libPcap, the C++
library to read row packet data.
From this experience, I learned that being a data scientist doesn’t only demand you to
gain domain knowledge, but also go an extra mile to carry out the engineering involved
before the model building process starts. It has also given me experience and exposure
to grow as a technologist who has more grasp of the engineering aspect as well.
7.6.1.4 Learning about Networking and Cyber-security and Internet of Things
Cisco Systems mainly deals with networking infrastructure. OCTO-SBG team that I was
placed in focused on cyber-security. My project was mainly focused on building machine
Chapter 7. A change of perspective. . . CISCO 107
learning and analytics around enterprise IoT networks. Having to work in this domain, I
had the opportunity to learn core networking concepts. I had to also learn about security
including anomaly detection, classification and non-machine learning knowledge relating
to types of threats, encryption, visibility and etc. . . Having a high focus on Enterprise
IoT networks such as manufacturing lines, I had the opportunity to understand the
protocol specifications in that space and to start the building data products around it
starting from the scratch, building the sensors to capture the right data itself. Through
this journey, I have filed 2 patent applications for the data sensor and the machine
learning techniques used for analytics. I have also become one of the subject experts on
Common Industrial Protocol (CIP)
7.6.1.5 Building data visualization techniques
During my project in gaining visibility into IoT traffic, I was also involved in building
an application to visualize enterprise traffic. I learned to use d3 JavaScript library to
visualize the data I generated from packet captures. I had to read and understand the
human element involved in the process. The most important part of security analytics
is to present results in a analyst intuitive way. Therefore, I had to learn how to present
the information extracted in the most simplest, yet information rich way.
7.6.2 Non-Technical benefits
I also gained a heap of non-technical benefits during my project at cisco. Some of them
can be described as follows:
7.6.2.1 Work Ethic and Discipline
By working in a company like Cisco, I gained first-hand experience in how to operate
in a large cooperation. Being in a setting where you are minimally managed, I got the
opportunity to train myself to be professional and deliver what is expected of me with
minimal supervision. While working in a very independent setting where employees can
work from Remote office or home, it helped me to improve my work discipline to work
in such as setting. Having my personal mentor and weekly meetings with my manager
helped to always discuss my matters and progress under their guidance.
Chapter 7. A change of perspective. . . CISCO 108
7.6.2.2 Personal Skills
After moving to San Jose, I have had the privilege to associate and get to know a
lot of brilliant minds both professionally and personally. I have had the opportunity
to associate with them to share our technological, cultural and personal interests to
complement our knowledge. Getting to spend a year with 70 exciting interns from all over
the world coming from around 15 different nationalities itself has been an unforgettable
experience. Furthermore, the team, fellow colleagues have grown to be my personal
friends. I have also had the privilege of working with several distinguished engineers
who constantly keep in touch with me. I am confident that the personal and professional
relations I have grown through this year will help me both personally and professionally
in the years to come.
I have also gained a lot of insight to the latest trends in the Silicon Valley and it has had
an immense impact on my career prospects. I have had great exposure and insight into
large scale machine learning and grown a new interest in enhancing my skillset along
this line. I have also played a lead role in shaping IoT related security products. As
Internet of Things network normally capture a massive amount of data, I am confident
my grown interest in large scale machine learning will complement more work in IoT.
7.7 Conclusion
Cisco Systems is a great choice for an international internship as it will give a very
valuable experience to fresh graduate by exposing him in to the large cooperation work
setting. There is a lot of room to improve your software engineering process and machine
learning skills. Cisco is a place where you can grow as a professional as you can train to
discipline yourself with well defined process cycles and procedures that are put in place
to manage the complexity of work. Being an intern at Cisco, I got the opportunity to
train myself in many technical aspects apart from machine learning as I had to take
part in building the data sensors and visualisation application. Cisco benefited me as
I got the opportunity to use high performance computing resources and manage them
for my work. Apart from Cisco, this internship has given me huge career boost with the
knowledge and exposure I gained by vising Meetups, tradeshows and Hackathons in the
Silicon Valley. I have also made a lot of friends and professional connections that will
benefit me to do great things in future.
Appendix A
Appendix
The Test results in Phase 2 of the LDA experiment
Source Code
109
Bibliography
R. Meuth, P. Robinette, and D.C. Wunsch. Computational intelligence meets the netflix
prize. In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Compu-
tational Intelligence). IEEE International Joint Conference on, pages 686–691, June
2008. doi: 10.1109/IJCNN.2008.4633869.
J.R. Spiegel, M.T. McKenna, G.S. Lakshman, and P.G. Nordstrom. Method and system
for anticipatory package shipping, December 24 2013. URL http://www.google.com/
patents/US8615473. US Patent 8,615,473.
K. K Ladha. The condorcet jury theorem, freespeech, and correlated votes.
American Journal of Political Science, 36(3):617–634, August 1992. URL
http://www.jstor.org/discover/10.2307/2111584?uid=3738456&uid=2&uid=
4&sid=21104114670901.
A. Anastasi and S. Urbina. Psychological Testing. Prentice Hall, Upper Saddle River,
NJ, USA, 7 edition, 1997.
V. C. Raykar, S. Yu, Zhao L. H., G. H. Valadez, C. Florin, L. Bogoni, and L Moy.
Learnign from crowds. Journal of Machine Learning Research, 11:1297–1322, April
2010. URL http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/publications/
raykar_JMLR_2010_crowds.pdf.
Y. Bachrach, T. Minka, J. Guiver, and T Graepel. How to grade a
test without knowing the answers - a bayesian graphical model for adap-
tive crowdsourcing and aptitude testing. In 29th International Confer-
ence on Machine Learning, Edinburgh , Scotland, UK, 2012. ICML. url
:http://research.microsoft.com/apps/pubs/default.aspx?id=164692.
M. Danilevsky. Beyond bag-of-words: N-gram topic models.
J. B. Lovins. Development of a stemming algorithm. Translation and Computational
Linguistics, 11(1):22–31, 1968.
C. D. Paice. Another stemmer. SIGIR Forum, 24(3):56–61, 1990.
110
Bibliography 111
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
xapian.org. Stepping algorithms. http://xapian.org/docs/stemming.html.
A. Rajaraman and J.D Ullman. Data mining. In Mining of Massive Datasets, pages
1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452.
F. Xia, T. Jicun, and L. Zhihui. A text categorization method based on local docu-
ment frequency. In Fuzzy Systems and Knowledge Discovery, 2009. FSKD ’09. Sixth
International Conference on, volume 7, pages 468–471, Tainjin, China, August 2009.
IEEE.
A. N. K. Zaman, P. Matsakis, and C. Brown. Evaluation of stop word lists in text re-
trieval using latent semantic indexing. In Digital Information Management (ICDIM),
2011 Sixth International Conference on, pages 133–136, Melboiurne,QLD, Australia,
September 2011. IEEE.
D. E. Knuth. Seminumerical algorithms. In Art of Programming Vol. 2, page 694.
Addison Wesley, 3 edition, 1998. Knuth also lists other names that were proposed for
multisets, such as list, bunch, bag, heap, sample, weighted set, collection, and suite.
Martin Reh´ak, Michal Pechoucek, Pavel Celeda, Jiri Novotn´y, and Pavel Minar´ık. CAM-
NEP: agent-based network intrusion detection system. In Michael Berger, Bernard
Burg, and Satoshi Nishiyama, editors, AAMAS (Industry Track), pages 133–136.
IFAAMAS, 2008. ISBN 978-0-9817381-3-0. URL http://doi.acm.org/10.1145/
1402795.1402820.
Wikipedia. tf-idf. URL http://en.wikipedia.org/wiki/Tf%E2%80%93idf. retrived
on : 12 February 2014.
J. Ramos. Using tf-idf to determine word relevance in document queries. URL https://
www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf. Re-
trived on 15 February 2014.
D. M. Blei and Lafferty J. D. Topic models. URL https://www.cs.princeton.edu/
~blei/papers/BleiLafferty2009.pdf. Retrived on 15 February 2014.
Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple, good sentiment
and topic classification. In ACL (2), pages 90–94. The Association for Computer Lin-
guistics, 2012. ISBN 978-1-937284-25-1. URL http://www.aclweb.org/anthology/
P12-2018.
Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: maximum margin supervised topic
models for regression and classification. In ICML, volume 382 of ACM International
Bibliography 112
Conference Proceeding Series, page 158. ACM, 2009. ISBN 978-1-60558-516-1. URL
http://doi.acm.org/10.1145/1553374.1553535.
B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in
Neural Information Processing Systems, volume 16, 2003.
J. Zhu, A. Ahmed, and E. P. Xing. Med lda : Maximum margin supervised topic models.
Journal of Machine Learning Research, 13:2237–2278, 2012.
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297,
September 1995.
Boser, Guyon, and Vapnik. A training algorithm for optimal margin classifiers. In COLT:
Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann
Publishers, 1992.
Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines
and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge,
U.K., 2000.
Thorsten Joachims. Text categorization with support vector machines: Learning with
many relevant features. Technical Report LS VIII-Report, Universit¨at Dortmund,
Dortmund, Germany, 1997.
J. Kivinen, M. Warmuth, and P. Auer. Linear vs. logarithmic mistake bounds when few
input variables are relevant. In The perceptron algorithm vs. winnow, 1995. Conference
on Computational Learning Theory.
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins.
Text classification using string kernels. Journal of Machine Learning Research, 2:419–
444, 2002.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
Richard Harshman. Indexing by latent semantic analysis. JOURNAL OF THE
AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990.
Thomas Hofmann. Probabilistic latent semantic indexing. pages 50–57, 1993.
Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):
395–416, 2007.
J. Shi and J. Malik. Normalized cuts and image segmentation. In Transactions on
Pattern Analysis and Machine Intelligence, volume 22, pages 888–905. IEEE, 2000.
Bibliography 113
A. Ng, M. Jordan, Y. Weiss, T. Dietterich, S. Becker, and Z Ghahramani. On spectral
clustering: analysis and an algorithm. Advances in Neural Information Processing
Systems, 14:849–856, 2002.
D. M. Blei and J. D. Lafferty. Correlated topic models. Advances in Neural Information
Processing Systems, 18, 2006.
D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum. Hierarchical topic
models and the nested chinese restaurant process. In Advances in Neural Information
Processing Systems, volume 16. MIT Press, 2004.
David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, 2012.
Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation
methods for topic models. 2009.
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evalua-
tion of topic coherence. In HLT-NAACL, pages 100–108. The Association for Com-
putational Linguistics, 2010. ISBN 978-1-932432-65-7.
2012. URL https://www.zooniverse.org/. retrived on 19 November 2013.
Berinsky, J. Adam, Huber, A. Gregory, Lenz, and S. Gabriel. Evaluating online
labor markets for experimental research: Amazon.com’s mechanical turk. URL
JournalistsResource.org. retrived on 18 June 2012.
Paolacci, Gabriele, Chandler, Jesse, Ipeirotis, and Panos. Running experiments on
amazon mechanical turk. 2010.
Buhrmester, Michael, Kwang, Tracy, Gosling, and Sam. Amazon’s mechanical turk
a new source of inexpensive, yet high-quality, data? Perspectives on Psychological
Science, 6(1):3–5, January 2011.
John Joseph Horton and Lydia B. Chilton. The labor economics of paid crowdsourcing.
CoRR, abs/1001.0627, 2010.
Mark Summerfield. Rapid GUI programming with Python and Qt: the definitive guide to
PyQt programming. Prentice Hall open source software development series. Prentice-
Hall, 2008.
V. R. Guido. Setl (was: Lukewarm about range literals). URL https://mail.python.
org/pipermail/python-dev/2000-August/008881.html. Retrieved 13 March 2011.
T. Peters. Pep 20- the zen of python, 2008.
W. McKinney. Python for Data Analysis. O’Reilly Media, Inc, 2013.
Bibliography 114
E. Bressert. SciPy and NumPy: an overview for developers. O’Reilly Media, Inc, 2012.
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler,
Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API
design for machine learning software: experiences from the scikit-learn project. CoRR,
2013.
Steven Bird, Ewan Klein, Edward Loper, and Jason Baldridge. Multidisciplinary in-
struction with the natural language toolkit. January 01 2008.
Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python.
O’Reilly & Associates, Inc., pub-ORA:adr, 2009.
Jacob Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing,
2010.
Radim ˇReh˚uˇrek and Petr Sojka. Software framework for topic modelling with large
corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
Matthew D. Hoffman, David M. Blei, and Francis R. Bach. Online learning for latent
dirichlet allocation. In John D. Lafferty, Christopher K. I. Williams, John Shawe-
Taylor, Richard S. Zemel, and Aron Culotta, editors, NIPS, pages 856–864. Curran
Associates, Inc, 2010.
Amazon. Understanding hit types, 2014. URL http://docs.aws.
amazon.com/AWSMechTurk/latest/AWSMechanicalTurkRequester/Concepts_
HITTypesArticle.html. Retrieved 13 March 2011.
AbiWord. Dictionaries, 2005. URL http://www.abisource.com/~fjf/InvisibleAnt/
Dictionaries.html. Retrieved 15 March 2011.
Andrew Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance,
2004. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.
145;http://www.robotics.stanford.edu/~ang/papers/icml04-l1l2.pdf.
Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. Journal of Machine Learning Research, 2:265–292,
2001.
Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview.
IJDWM, 3(3):1–13, 2007. URL http://dx.doi.org/10.4018/jdwm.2007070101.
Bibliography 115
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-
scale sentiment classification: A deep learning approach, 2011. URL http://hal.
archives-ouvertes.fr/hal-00752091.
Shao-Yu Lien, Kwang-Cheng Chen, and Yonghua Lin. Toward ubiquitous massive ac-
cesses in 3GPP machine-to-machine communications. IEEE Communications Maga-
zine, 49(4):66–74, 2011. URL http://dx.doi.org/10.1109/MCOM.2011.5741148.
Marthinus J. Booysen, John S. Gilmore, Sherali Zeadally, and Gert-Jan van Rooyen.
Machine-to-machine (M2M) communications in vehicular networks. TIIS, 6(2):529–
546, 2012. URL http://dx.doi.org/10.3837/tiis.2012.02.005.
Dusit Niyato, Xiao Lu, and Ping Wang. Machine-to-machine communications for home
energy management system in smart grid. IEEE Communications Magazine, 49(4):
53–59, 2011.

More Related Content

PDF
PhD-2013-Arnaud
PDF
Upstill_thesis_2000
PDF
Investigation in deep web
PDF
Masters Thesis: A reuse repository with automated synonym support and cluster...
PDF
Ibm watson analytics
PDF
IBM Watson Content Analytics Redbook
PDF
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
PhD-2013-Arnaud
Upstill_thesis_2000
Investigation in deep web
Masters Thesis: A reuse repository with automated synonym support and cluster...
Ibm watson analytics
IBM Watson Content Analytics Redbook
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...

What's hot (19)

PDF
Hub location models in public transport planning
PDF
2013McGinnissPhD
PDF
Efficient algorithms for sorting and synchronization
PDF
From sound to grammar: theory, representations and a computational model
PDF
Thesis-aligned-sc13m055
PDF
Jmetal4.5.user manual
PDF
Oop c++ tutorial
PDF
Mining of massive datasets
PDF
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
PDF
Nguyễn Nho Vĩnh - Problem solvingwithalgorithmsanddatastructures
PDF
Machine_translation_for_low_resource_Indian_Languages_thesis_report
PDF
Vekony & Korneliussen (2016)
PDF
Szalas cugs-lectures
PDF
A Bilevel Optimization Approach to Machine Learning
PDF
The Dissertation
PDF
Mining of massive datasets
PDF
dmo-phd-thesis
PDF
Baron rpsych
PDF
User manual
Hub location models in public transport planning
2013McGinnissPhD
Efficient algorithms for sorting and synchronization
From sound to grammar: theory, representations and a computational model
Thesis-aligned-sc13m055
Jmetal4.5.user manual
Oop c++ tutorial
Mining of massive datasets
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Nguyễn Nho Vĩnh - Problem solvingwithalgorithmsanddatastructures
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Vekony & Korneliussen (2016)
Szalas cugs-lectures
A Bilevel Optimization Approach to Machine Learning
The Dissertation
Mining of massive datasets
dmo-phd-thesis
Baron rpsych
User manual
Ad

Viewers also liked (19)

DOC
OZIABOR FRANCIS IBHAS 1
DOC
Ranjith.r cv
PDF
Curriculo nacional-2016-2
PDF
Matemáticas y olimpiadas 4to primaria
PPTX
Baltic wall презентация (ru)
PPTX
How we can promote Automotive Industry?
PPTX
PDF
WP_oncology_molecule_value_d01
PPT
A+ program
DOCX
Mohammed11[1] (1)-2
DOC
CV_NGUYEN TIEN PHUC
PPTX
Participación de la Sociedad Civil en el Fondo Verde del Clima
PDF
Database report
PDF
Portfolio Nilos: alcuni dei nostri clienti
PDF
Experis Hungary to linkedin Krémó Márta
DOCX
Brent.Clancy.Final.Lab4
PPTX
Mua dụng cụ vp
PDF
El Pop/ Cultura y Contracultura de los Beathles a Michael Jackson
OZIABOR FRANCIS IBHAS 1
Ranjith.r cv
Curriculo nacional-2016-2
Matemáticas y olimpiadas 4to primaria
Baltic wall презентация (ru)
How we can promote Automotive Industry?
WP_oncology_molecule_value_d01
A+ program
Mohammed11[1] (1)-2
CV_NGUYEN TIEN PHUC
Participación de la Sociedad Civil en el Fondo Verde del Clima
Database report
Portfolio Nilos: alcuni dei nostri clienti
Experis Hungary to linkedin Krémó Márta
Brent.Clancy.Final.Lab4
Mua dụng cụ vp
El Pop/ Cultura y Contracultura de los Beathles a Michael Jackson
Ad

Similar to Thesis (20)

PDF
Master's Thesis
PDF
PDF
Understanding and Preventing Cyberbullying_Report.pdf
PDF
Master's Thesis
PDF
Arules_TM_Rpart_Markdown
PDF
ACOMPARATIVEANALYSISOFDEEPLEARNINGMODEL FOR FLOWER RECOGNITION AND HEALTH PRE...
PDF
PDF
Scikit learn 0.16.0 user guide
PDF
Big data-and-the-web
PDF
Computer Security: A Machine Learning Approach
PDF
Computer security using machine learning
PDF
Mohan C R CV
PDF
Predicting the success of altruistic requests
PDF
PDF
Data Discretization Simplified: Randomized Binary Search Trees for Data Prepr...
PDF
pyspark.pdf
PDF
DILE CSE SEO DIGITAL GGGTECHNICAL INTERm.pdf
PDF
JOB MATCHING USING ARTIFICIAL INTELLIGENCE
PDF
JOB MATCHING USING ARTIFICIAL INTELLIGENCE
PDF
JOB MATCHING USING ARTIFICIAL INTELLIGENCE
Master's Thesis
Understanding and Preventing Cyberbullying_Report.pdf
Master's Thesis
Arules_TM_Rpart_Markdown
ACOMPARATIVEANALYSISOFDEEPLEARNINGMODEL FOR FLOWER RECOGNITION AND HEALTH PRE...
Scikit learn 0.16.0 user guide
Big data-and-the-web
Computer Security: A Machine Learning Approach
Computer security using machine learning
Mohan C R CV
Predicting the success of altruistic requests
Data Discretization Simplified: Randomized Binary Search Trees for Data Prepr...
pyspark.pdf
DILE CSE SEO DIGITAL GGGTECHNICAL INTERm.pdf
JOB MATCHING USING ARTIFICIAL INTELLIGENCE
JOB MATCHING USING ARTIFICIAL INTELLIGENCE
JOB MATCHING USING ARTIFICIAL INTELLIGENCE

Thesis

  • 1. University College London Masters Thesis Exploration of a self adaptive topic engine Author: Maliththa S. S. Bulathwela Institution: University College London Supervisors: Prof. John Shawe-Taylor Institution: University College London Dr. Martin Goodson Qubit Digital Inc. A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Statistics and Machine Learning Department of Computer Science September 2014
  • 2. Declaration of Authorship I, Maliththa S. S. Bulathwela, declare that this thesis titled, ’Exploration of a self adaptive topic engine’ and the work presented in it are my own. I confirm that: This report is submitted as part requirement for the MSc Degree in Computational Statistics and Machine Learning at University College London. It is substantially the result of my own work except where explicitly indicated in the text. The report may be freely copied and distributed provided the source is explicitly acknowledged. Signed: Date: i
  • 3. “Our language is the reflection of ourselves. A language is an exact reflection of the character and growth of its speakers.” Cesar Chavez
  • 4. UNIVERSITY COLLEGE LONDON Abstract Faculty of Engineering Department of Computer Science Master of Science Exploration of a self adaptive topic engine by Maliththa S. S. Bulathwela The primary objective of this work is to build a reliable solution that can extract in- sight out of customer feedback data (Natural Language). The problem was initially ad- dressed using a topic classifier built using supervised Support Vector Machines (SVM). As the initial data was classified using Amazon Mechanical Turk, some trust modeling work from literature is adapted to develop heuristics that enhances the reliability of the dataset. The SVM models were built with different pre-processing specifications and bigram/unigram features to empirically assess the best choice. The best models obtained 93% accuracy with the Qubit dataset when evaluated using Hamming Loss based misclassification error. The latter part of the experiment was directed towards detecting emerging topics from the dataset, as the majority of the labeled topics did not belong to pre-defined topics. Latent Dirichlet Allocation (LDA) model was used to fulfill this task. As there was a labeled dataset at already, these labels were used to tune the model parameters automatically. In a 4 phase experiments, initially it was assessed how many of the labeled topics can be detected using LDA model. 4/10 topics were detected in this phase. In phase 2, a simulation experiment was launched to evaluate if labeled topics can be detected if they were introduced as emerging topics, results showed that the model could detect 6/10 topics amongst 10 datasets. Using the statistical qualities of the consistent topic, a scoring function was built to assess topic consistency. Finally, the parameters and the scoring function was used to detect an emerging topic out of the unlabeled data. The detected topic suggested to be related to features of the business. The observations deem to belong to the topic also seem to prove that the topic was consistent. The results of the experiments are very promising. The finding also suggest that further work is necessary to evaluate if self adapting topic engines can be built using some techniques developed through the thesis.
  • 5. Acknowledgements I thank almighty god for blessing me with strength and wisdom to pursue my interests and accomplish my goals. I would like to thank my internal supervisor, Prof. John Shawe-Taylor, for being an excellent supervisor guiding me through this journey. I would also like to thank my co-supervisor, Dr. Martin Goodson for his invaluable supervision, guidance and support extended to shape my skills in both theory and practice of machine learning. I thank both of them for dedicating their time and efforts to support me. I would also like to thank the Computer Science department at UCL, research team at Qubit, and all my relatives and friends for strengthening me constantly I would like to acknowledge the support provided by my family during the preparation of my masters project. iv
  • 6. Contents Declaration of Authorship i Abstract iii Acknowledgements iv Contents v List of Figures ix List of Tables x Abbreviations xi 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3.1 Analyzing what is captured . . . . . . . . . . . . . . . . . . . . . . 2 1.3.2 Changing business landscape . . . . . . . . . . . . . . . . . . . . . 3 1.4 Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Review 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Market Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Big Data and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Trust Modeling in Crowd sourced data . . . . . . . . . . . . . . . . . . . . 6 2.5 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . . . . 7 2.5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5.1.1 Importance . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5.1.2 Spell Correction . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.1.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 v
  • 7. Contents vi 2.5.1.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5.1.5 Stopword Removal . . . . . . . . . . . . . . . . . . . . . . 9 2.5.2 Feature Extraction: Vectorization . . . . . . . . . . . . . . . . . . 9 2.5.2.1 N-gram Bag of words . . . . . . . . . . . . . . . . . . . . 10 2.5.2.2 Term Frequency – Inverse Document Frequency (TF-IDF) 11 2.5.3 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.3.1 Support Vector Machines (SVM) . . . . . . . . . . . . . . 12 2.5.3.2 Na¨ıve Bayes Classifier . . . . . . . . . . . . . . . . . . . . 13 2.5.3.3 Benefits of using SVM for text classification . . . . . . . 14 2.5.4 Latent Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.4.1 Latent Semantic Indexing (LSI) . . . . . . . . . . . . . . 16 2.5.4.2 Latent Dirichlet Allocation (LDA) . . . . . . . . . . . . . 16 2.5.4.3 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 18 2.5.4.4 Benefits of using LDA for text classification . . . . . . . . 19 2.6 Topic Consistency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Tools and Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7.1 Amazon Mechanical Turk (MTurk) . . . . . . . . . . . . . . . . . . 20 2.7.2 Python 2.7 (Programming Language) . . . . . . . . . . . . . . . . 21 2.7.3 Special purpose libraries . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7.3.1 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7.3.2 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7.3.3 Natural Language ToolKit (NLTK) . . . . . . . . . . . . 24 2.7.3.4 PyEnchant . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7.3.5 Gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Data Collection and Preprocessing 26 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Data Collection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Labelling the observations . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Initial Trust Modelling phase . . . . . . . . . . . . . . . . . . . . . 28 3.2.2.1 Topic Distribution of the dataset . . . . . . . . . . . . . . 29 3.2.2.2 Initial Sanity check . . . . . . . . . . . . . . . . . . . . . 30 3.2.2.3 Worker Scoring . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2.4 Experience of worker . . . . . . . . . . . . . . . . . . . . 38 3.2.2.5 Unique Feedback Scoring . . . . . . . . . . . . . . . . . . 39 3.2.2.6 Time of Day . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2.7 Replicated feedback scoring . . . . . . . . . . . . . . . . . 42 3.2.3 Selection and Rejection Criteria . . . . . . . . . . . . . . . . . . . . 43 3.2.4 Directions for final data collection . . . . . . . . . . . . . . . . . . 44 3.2.5 The final strategy for data collection . . . . . . . . . . . . . . . . . 45 3.2.6 Final Data collection phase . . . . . . . . . . . . . . . . . . . . . . 45 3.3 The Final Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 The topic distribution of the final dataset . . . . . . . . . . . . . . 47 3.4 Preprocessing steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1 Text Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Spell correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
  • 8. Contents vii 3.5 Preprocessing sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Preprocessing pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Topic Classification for Labelled Observatoins 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Implementation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Selecting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 Selecting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.4 Final Process Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.2 Unigram Model Results . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.3 Bigram Model Results . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.4 Unigram + Bigram Model Results . . . . . . . . . . . . . . . . . . 57 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Topic Detection Automation 60 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Experimentation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 First Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.2 Second Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Third Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4 Fourth Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.1 Phase 1: Tuning LDA parameters . . . . . . . . . . . . . . . . . . 62 5.3.2 Phase 2: Simulating an emerging topic . . . . . . . . . . . . . . . . 63 5.3.3 Phase 3: Developing a Topic Consistency scoring function . . . . . 63 5.3.4 Phase 4: Final topic detection . . . . . . . . . . . . . . . . . . . . 64 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.2 Phase 1 Results: All topics datasets . . . . . . . . . . . . . . . . . 66 5.4.3 Phase 2 Results: Individual Datasets . . . . . . . . . . . . . . . . . 69 5.4.3.1 Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.3.3 Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.3.4 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.4 Phase 4 results: None of the above dataset . . . . . . . . . . . . . 80 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Conclusion and Future Directions 84 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Trust modeling for cleaner data . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Topic classification with SVM . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4 Topic Detection with LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.5 Potential Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5.1 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
  • 9. Contents viii 6.5.2 Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5.3 Crowdsourcing ++ . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.3.1 Using crowdsourcing to label the emerging topics auto- matically . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.3.2 Automatically building the decision boundaries for the new topic . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7 A change of perspective. . . CISCO 93 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2.1 Career Development . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2.2 Industry Expertise . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2.3 Learn from the organizational culture . . . . . . . . . . . . . . . . 94 7.2.4 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3 Background Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3.1 Cisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3.2 CIIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.4 My role and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4.1 My project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4.1.1 Problem Background . . . . . . . . . . . . . . . . . . . . 98 7.4.1.2 The goal of the research project . . . . . . . . . . . . . . 99 7.4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.4.1.4 Project outcomes . . . . . . . . . . . . . . . . . . . . . . 102 7.5 Non technical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.5.1 Universities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.5.2 Meetups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.5.3 Companies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5.4 Hackathons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6.1 Technical Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.6.1.1 Machine Learning and Large Scale data processing . . . . 105 7.6.1.2 Training to manage a large pool of computing resources . 106 7.6.1.3 Building Data Sensors . . . . . . . . . . . . . . . . . . . . 106 7.6.1.4 Learning about Networking and Cyber-security and In- ternet of Things . . . . . . . . . . . . . . . . . . . . . . . 106 7.6.1.5 Building data visualization techniques . . . . . . . . . . . 107 7.6.2 Non-Technical benefits . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.6.2.1 Work Ethic and Discipline . . . . . . . . . . . . . . . . . 107 7.6.2.2 Personal Skills . . . . . . . . . . . . . . . . . . . . . . . . 108 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A Appendix 109 Bibliography 110
  • 10. List of Figures 2.1 TF-IDF notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Na¨ıve Bayes classifier model . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 LDA notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Topics in Initial Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Worker aggregated performance over lifetime . . . . . . . . . . . . . . . . 37 3.3 Worker aggregated performance over lifetime . . . . . . . . . . . . . . . . 38 3.4 Step Histogram of Feedback difficulty score . . . . . . . . . . . . . . . . . 41 3.5 Score for observations on different time of day . . . . . . . . . . . . . . . . 41 3.6 The Score distribution of the silver set observations in final dataset . . . . 46 3.7 Topic Distribution of the final dataset . . . . . . . . . . . . . . . . . . . . 47 5.1 Topic Inference comparison at 100 passes . . . . . . . . . . . . . . . . . . 67 5.2 Topic Inference comparison at 200 passes . . . . . . . . . . . . . . . . . . 68 5.3 17 topics inferred in LDA analysis . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Topic Inference comparison at 300 passes in Delivery dataset . . . . . . . 71 5.5 Topic Inference comparison at 300 passes in Images dataset . . . . . . . . 73 5.6 Topic Inference comparison at 300 passes in stock dataset . . . . . . . . . 74 5.7 Topic Inference comparison at 300 passes in size dataset . . . . . . . . . . 75 5.8 Topic Inference Histogram at 300 passes in Delivery dataset . . . . . . . . 77 5.9 Topic Inference Histogram at 300 passes in Images dataset . . . . . . . . . 77 5.10 Topic Inference Histogram at 300 passes in Size dataset . . . . . . . . . . 78 5.11 Topic Inference Histogram at 300 passes in Stock dataset . . . . . . . . . 78 6.1 Final System with future developments . . . . . . . . . . . . . . . . . . . . 92 7.1 The structure of OCTO-SBG . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2 Goals of gaining insight into IoT traffic . . . . . . . . . . . . . . . . . . . . 100 7.3 CIP Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.4 Visualization application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 ix
  • 11. List of Tables 3.1 Legend of X-axis labels for topic histograms . . . . . . . . . . . . . . . . . 30 5.1 The number of observations in each dataset . . . . . . . . . . . . . . . . . 64 5.2 The Topic legend for X-axis in inference plots . . . . . . . . . . . . . . . . 67 5.3 Mapping of LDA topics to Labeled topics . . . . . . . . . . . . . . . . . . 69 5.4 Summary of performance statistics of emerging topics in different datasets 70 5.5 Summary statistics of Delivery dataset . . . . . . . . . . . . . . . . . . . . 72 5.6 Summary statistics of Stock dataset . . . . . . . . . . . . . . . . . . . . . 74 5.7 Summary of score statistics of emerging topics in different datasets . . . . 81 5.8 Summary statistics of None of the Above dataset . . . . . . . . . . . . . . 81 x
  • 12. Abbreviations API Application Programming Interface ASCII American Standard Code for Information Interchange BOW Bag of Words GIGO Garbage In Garbage Out GUI Graphical User Interface HDP Hierarchical Dirichlet Process HIT Human Intelligence Task hLDA hierarchical Latent Dirichlet Allocation IDF Inverse Document Frequency LDA Latent Dirichlet Allocation LSI Latent Semantic Indexing M3N Max Margin Markov Network MCQ Multiple Choice Questions MedLDA Maximum entropy discrimination Latent Dirichlet Allocation MLE Maximum Likelihood Estimate mTurk Mechanical Turk NLP Natural Language Processing NLTK Natural Language ToolKit OOP Object Oriented Programming pLSI probabilistic Latent Semantic Indexing sLDA supervised Latent Dirichlet Allocation ssLDA semi-supervised Latent Dirichlet Allocation PyPI Python Package Index SVM Support Vector Machine TF Term Frequency xi
  • 13. Dedicated to my parents xii
  • 14. Chapter 1 Introduction 1.1 Introduction Natural Language Processing (NLP) is one of the most rigorously investigated and ap- plied domains in Machine Learning that is making a massive impact on how people approach problems. With the dot com boom and the big data revolution, the power of data has never been stronger. The computer industry itself has adapted to reap the harvest from this newborn opportunity as one can see how software giants like Google, Facebook and Cisco are building more and more data products. Natural Language, amongst various data points being captured, is a very valuable data resource that is in abundance in supply. The main method of expression of humans is natural language. People all over the world use natural language as the one way of leaving footprint in the Internet in the form of content, feedbacks, reviews, ideas, opinions and etc. . . A great deal of insight can be drawn from this data and companies like Google make a major fraction of their revenue via adding value to their services through this data. Through- out this thesis, it is attempted to investigate how natural language can be used to bring more value and insight to customer feedback analysis. 1.2 Background Context The main input to this project is customer feedback collected from client websites that are being supported by Qubit Digital Ltd. 1
  • 15. Chapter 1. Introduction 2 1.2.1 Company Qubit Digital develops technologies to collect and process web data. The company offers solutions relating to tag management, marketing attribution and etc. . . They also offer market analytics solutions relating to conversion optimization, customer intelligence and behavioral attribution. 1.2.2 Product The feedback collection system of Qubit captures user feedback. Initially, when a cus- tomer visits a website, the behavioral models start predicting the next action of the customer depending on his/her session history. When the system predicts that the cus- tomer is going to leave the website, the website prompts the customer for feedback on where the website can improve. Whatever customers type in that prompt is captured and stored by the feedback collection system. This is the main data source being used for the research project. 1.3 Problem Overview The feedback collection system of Qubit Digital collects thousands of feedback entries everyday from multiple client websites they serve. These feedbacks mainly relate to the problems in the website or valuable suggestions on how to improve the website or the business. This information is very valuable for improvement of products and services by the client. The insight is vital for market adaptation and strategic growth. 1.3.1 Analyzing what is captured Unfortunately, the scale of the feedback collection is massive and analyzing content of such scale using human intelligence is infeasible both financially and operationally. Therefore, automation is an essential feature in analyzing such a big feedback corpus. As a starting point, clients are primarily interested in gaining insight into main topics such as Delivery, Price, Product, Latency of the website and etc. . . As manually categorizing different feedbacks into topics is unrealistic, an automatic topic classification solution is necessary to learn the unique patters in feedback and map them to different topics.
  • 16. Chapter 1. Introduction 3 1.3.2 Changing business landscape Although topic categorization is automated, the business landscape is constantly chang- ing. Even within the scope of an individual business entity, the topics being discussed change when various business factors change. A static classification model cannot cope up with a data generation process that is dynamic by nature. Therefore, the topic cate- gorization engine should be sensitive to emerging trends and adapt itself to the changing topics. A feature that can detect newly emerging topics is necessary to address this issue. 1.4 Solution Strategy In order to solve the topic categorization problem, a supervised learning topic classifier is a very good candidate. A set of essential topics will be defined initially. Then one should use a human work force to correctly classify a sample of feedbacks into the pre- defined topics. Then this dataset can be used to train a supervised learning classifier that can automatically infer unseen observations into one or more pre-defined topics. In order to extend the topic model automatically, latent topic detection can be used to detect unseen topics that are emerging in the dataset. These topics have to be searched in data that is does not belong to any of the pre-defined topics. Latent topic detection is an unsupervised learning solution. Therefore, manual classification is unnecessary. 1.5 Chapter Overview This work will partly contribute to topic categorization and topic detection in application basis. It describes how supervised topic categorization and latent topic detection can be used in industry setting to automate information extraction from customer feedback data. It also attempts to enforce trust modeling heuristics and when using crowd sourced datasets that use Mechanical Turk. This Thesis also details the attempts to formulate a scoring function that can be used to evaluate topic consistency of an emerged topic without using external corpora. The chapter overview of the thesis is as follows: Chapter 1: Introduction This chapter introduces the background context, problem and the solution strategy that is described in the rest of the thesis.
  • 17. Chapter 1. Introduction 4 Chapter 2: Literature review This chapter introduces all the background knowledge that is being explored and exploited throughout the thesis. Details about the problem domain, background, Application, Different techniques, methods and potential solutions are outlined. The content discusses technical details, theory, and the current research landscape of the domain. Chapter 3: Data Collection and Preprocessing As the name suggests, technical details regarding the data collection and data pre-processing steps are detailed in this chapter. The main focus is on the data collection process with a focus on the worker evaluation heuristics used. The preprocessing steps are also described concisely. Chapter 4: Topic Classification for labeled observations This chapter details how the supervised learning model was built using the labeled dataset. Feature extraction, Feature selection, Parameter tuning and evaluation metric is discussed in detail. Finally the results are reported with the conclusion. Chapter 5: Topic Detection within Unlabeled Topics This chapter’s primary focus is on using unsupervised learning to detect emerging topics from the unlabeled dataset. The experiment is carried out in 4 phases where the initial phases focus on using the labeled data to drive the parameter tuning process. Details of building the topic consistency measurement function are also outlined in this chapter. The chapter concludes by reporting results and deriving conclusions from the experiment. Chapter 6: A change of perspective. . . CISCO This chapter summarizes the details about the industrial project carried out with Cisco Systems in California for one calendar year. The chapter outlines the ex- perience, non-technical aspects and the machine learning related work undertaken during this period. Chapter 7: Conclusion This chapter concludes the thesis. It starts by summarizing all the work under- taken and results obtained. Then it discusses the primary conclusion of the study in details. The chapter ends with a good description of potential future work that can complement the work carried out in the thesis.
  • 18. Chapter 2 Literature Review 2.1 Introduction In this chapter, the main knowledge areas related to the research project will be discussed in detail. An overview of each topic will be followed by evidence from research in the topic supported by scientific literature. The background domain of the research question, market analytics and insight is dis- cussed in detail in the beginning of the chapter. The use of Natural Language processing and the machine learning approach to Natural Language Processing is presented in the next section with separate focus on data preprocessing, supervised topic modeling, La- tent topic detection. The opportunities and eliminating the human loop in the topic detection process is also discussed in this section supported with research evidence. The building tools used in experimentation and the rationale is discussed in the last section of the chapter. 2.2 Market Analytics Rapidly pacing through the information age, being able to record and store massive amounts of data has revolutionized the way people look at things. Data driven deci- sion making, pattern recognition and prediction capabilities this big data revolution has enabled people to approach analytics and decision making from a very different perspec- tive. Along with this paradigm shift, a lot of domains and entities have adapted big data and machine learning opportunities to reap invaluable benefits by having the edge. It is observed that numerous domains, both commercial and otherwise such as Finance, E-commerce and Marketing, Health care, Security, Physics, Search, Entertainment and 5
  • 19. Chapter 5. iterature Review 6 etc. . . have aggressively adapted big data and machine learning approaches to enhance their effectiveness. 2.3 Big Data and Machine Learning Market Analytics is one of those fields that have been gaining a lot of momentum thanks to big data and machine learning. Market analytics helps today’s organizations to use the mammoth of data points available through customer feedback to understand and tailor personalized products and services to their customer base better and quicker. Machine Learning and data mining are used in market analytics in so many different ways. Large organizations such as Google, Facebook and Netflix maintain their own Machine Learning teams within their marketing teams. Subscription companies such Netflix use the vast amount of data they collect from customer buying patterns to model customer churn (customer attrition) (Meuth et al., 2008). Amazon uses machine learning to predict purchases to start shipping them before purchase goes through (Spiegel et al., 2013). This lets Amazon have a competitive advantage over other online retailers by minimizing shipping time. During last few years, many new companies have sprung up in the field of enabling other enterprises of all scales to apply machine learning and data mining on customer feedback data. Companies such as Qubit Digital and Skimlinks provide Application Programming Interfaces (API), Toolkits and portals to capture, process and analyze customer feedback data. 2.4 Trust Modeling in Crowd sourced data Collective decision making (Ensemble) is a well-studied area in social choice, voting and Artificial Intelligence domains. The Condorcet Jury Theorem (Ladha, 1992) states that if a collection of agents take a binary decision by majority voting, if the probability p of an agent selecting the correct answer is ¿ 0.5, Adding more agents increases the probability of obtaining the correct decision. With the recent developments in technology, companies can now use crowd sourcing to carry out business tasks. Normally in cases such as intelligence testing,(Anastasi and Urbina, 1997) a gold set (a known set) is used to evaluate the responses in a crowd-sourcing task. Several attempts are recorded in studies to use aggregated responses in IQ testing. The main research question in this project is to see how to assess worker reliability using majority voting. Rayker et al (Raykar
  • 20. Chapter 5. iterature Review 7 et al., 2010) proposes a machine learning based method that doesn’t model the question difficulty. Bachrach et al presents a graphical model that models the question difficulty, participant ability and the true response to grade aptitude tests without knowing the answers.(Bachrach et al., 2012) These studies show that majority voting/ concordance of answers is a very useful factor when evaluate the accuracy of a response. 2.5 Natural Language Processing (NLP) Natural Language Processing (NLP) is an area that has been under extensive research during the last few years. Natural language being the main form of communication among people, a large quantity of data is being generated and consumed in the form of natural language every day. Data is generated in the form of Books, articles, feedbacks, reviews, blog posts and numerous other forms. Although a lot of business organizations capture and record these data points and although humans can interpret natural lan- guage very accurately, it is a tedious and expensive job to analyze this data by humans as the human efficiency cannon keep up with the rate data is generated. NLP is an upcoming solution to this problem that uses machine learning and pattern recognition to analyze natural language computationally. 2.5.1 Data Preprocessing Data preprocessing is a vital part of data science. Perfection is very difficult to attain in real world and hence, requires preparation before using machine-learning algorithms. Preprocessing data is mainly used to clean data by reducing noise, improving complete- ness and etc . . . 2.5.1.1 Importance Natural language is unstructured and diverse. It is also unorthodox compared to con- ventional machine learning processes, as machine learning is a tool utilized to detect patterns in numerical data. Due to the nature of natural language data, NLP involves data preprocessing. (Danilevsky) Although this technique is quite straight forward, pre- processing the text before vectorization has shown better accuracy and performance. There are several techniques that can be used to preprocess natural text before vector- izing it. They are, • Spell Correction
  • 21. Chapter 5. iterature Review 8 • Stemming • Lemmatization • Stopwords 2.5.1.2 Spell Correction Humans are primarily generating text data. The amount of moderation and review that the data undergoes before being published in the source can vary depending on the data source. Text data from published material such as peer reviewed articles, ref- ereed journals, news sources, books and other publications usually go through rigorous and iterative quality assurance process (eg: Authour guidelines in journal publications). But a fair fraction of text data generated by society such as user feedback, complaints, tweets, comments and reviews do not undergo the formerly mentioned amount of mod- eration and standardization. Therefore the latter is more error prone and more likely to contain user human errors such as spelling, grammatical errors. The amount of human commitment involved in generating user-generated text is very less compared to author generated text. Spell Correction improves the accuracy (precision) of results by reducing the noise in the document set. 2.5.1.3 Stemming Stemming is a method of collapsing distinct word forms. It allows collapsing different words in different words to a distinctive root form. Due to grammatical effects on natural language, the words undergo minor variations and transformations when being used in documents. Therefore, methods are necessary to map all these variations of worlds into one unique form to represent the similarity. Due to simple rule basis, word stemming is computationally less expensive. But the gain of accuracy is less. A stemming algorithm applies a selected number of word reduction rules sequentially. The outcome of the stemming algorithm is not mandatory to converge to the morpho- logical route of the word. But it is likely for several forms of the word to reduce to the same root form. In English language, a lot of stemmers such as Lovins Stemmer, (Lovins, 1968) Paise Stemmer.(Paice, 1990) In several studies, it has been empirically shown that Porter Stemmer (Porter, 1980) is very effective. Stemmers for other lan- guages also exist (xapian) (xapian.org) but are not relevant within the context of this project as the dataset consist of customer feedback stored in English.
  • 22. Chapter 5. iterature Review 9 2.5.1.4 Lemmatization Some unorthodox words such as “is”, “are” and “am” cannot be stemmed to a unique root form due their textual structure. If the accuracy of word normalization is very important, word Lemmatization can be used instead of stemming. Word lemmatization is a method, which does morphological analysis on the words to normalize them to the stem form. This method doesn’t simply try to use character reduction rule set. It analyses the full morphology of the word to identify the “lemma” using a dictionary. This process is more accurate compared to stemming but this accuracy is gained at the expense of computational complexity. Both stemming and lemmatization improves the recall (specificity) of the results while reducing the precision (sensitivity) of the results. 2.5.1.5 Stopword Removal Stop words are the words that are being filtered and removed before or after processing natural language text data. (Rajaraman and Ullman) Stop words are removed in text processing to improve accuracy. Words can be classified as stop words due to several reasons. They are words/tokens: 1. That are not useful in text processing 2. That do not have a impactful meaning 3. That are very common in a language 4. That only help the language structure (a, is, and , this and etc . . . ) Eliminating stop words increase accuracy by getting rid of useless tokens that are likely to add noise to the documents. It is analogous to cleaning the dataset. It also reduces the vocabulary of the model. This improves the performance and scalability of the solution by giving benefits on resource utilization (Memory and Computation). Studies show that stop word removal positively affects the accuracy of results. Furthermore, tailor-made stop word lists have shown to result better performance compared to using arbitrary stop word lists. (Xia et al., 2009) 2.5.2 Feature Extraction: Vectorization Feature extraction is the process of transforming arbitrary data such as images and text to numerical feature vectors that are usable by machine learning algorithms. Natural
  • 23. Chapter 5. iterature Review 10 Language is conventionally represented in Unicode or ASCII string format (text) where all the characters, words, sentences that retain information is represented. Unfortu- nately, machine-learning algorithms are designed to process data in a numerical form. Due to this reason, sting data should be transformed into numerical vectors (Vectoriza- tion) that can be processed by machine learning algorithms. The primary method used to transform natural language to numerical vector form is by using the Bag of Words (BOW) technique. (Zaman et al., 2011) 2.5.2.1 N-gram Bag of words N-gram bag of words is a feature extraction technique designed to extract word features from text. The method builds up a numerical vector that represents the presence of words in a document. In this representation, a document is represented as a multiset (Knuth, 1998) of words that are in the document. This method is analogous to build- ing histogram like statistics of the word occurrences in a document. This method is used widely in Document classification and clustering problems as a feature extraction method. But in most real world applications, it is evident that world phrases make a lot more sense compared to individual words. For instance, “time series” and “time machine” refer to completely different meanings rather than using “time”, “series” and “machine” to model separately and independently. There may be several advantages and upsides in using word phrases for topic modeling compared to words being treated separately. They are, 1. Word phrases are more informative: Word phrases give more information about the word sequences 2. The phrase meaning is not always represented or derivable from individual words 3. The context of the phrase id different depending on the word set and the order of their occurrence 4. that only help the language structure (a, is, and , this and etc . . . ) The word “n-gram” refers to a contiguous sequence of n units in a time dependent session. Depending on the application, the definition word may refer to a Word phrase (NLP), Amino Acid (Protein Sequencing), User state (conversion optimization), Network flag (Netflow R (Reh´ak et al., 2008)), Base pair (DNA Sequencing). The n-gram model with one word is referred as “unigram” model. 2 and 3 item models are referred to
  • 24. Chapter 5. iterature Review 11 as “Bigram” and “Trigram” respectively. N-grams preserve the temporal features of the observations because they preserve the word order. It also preserves the contextual meaning of tokens as it preserves word phrases. 2.5.2.2 Term Frequency – Inverse Document Frequency (TF-IDF) TF-IDF (Wikipedia) is a numerical transformation that weights the importance of word tokens depending on the word distribution. This statistic is often used as a weigh- ing factor in text analytics. TF-IDF can be used to determine stop words for in text summarization and creating tailor made stop word lists. Figure 2.1: TF-IDF notation Text analytics problems like document classification and information retrieval tries to find unique word phrases that can distinguish between different topics or concepts. TF-IDF is calculated by multiplying two statistics. 1. Term Frequency: The frequency of terms that occur in each document. This enables up-weighting terms that occur frequently in particular documents. This assumes that words that occur more frequently have more influence. 2. Inverse Document Frequency: The inverse of frequency of documents the term is occurred. The bigger the number of documents, smaller the IDF. This enables down-weighting the terms that are present in a wider fraction of the corpus. Depending on the application, various methods are used to calculate the TF-IDF value. IDF is a heuristic that is used to increase term specificity. Although it has worked well empirically, there are still attempts to understand the theoretical foundations of IDF in the lines of Information Theory. TF-IDF has shown to be a very effective transformation of bag of words in information retrieval. (Ramos) Studies show that TF-IDF enables local document wide relevance of terms via TF, and corpus wide relevance via IDF. (Blei and D.)
  • 25. Chapter 5. iterature Review 12 2.5.3 Topic Classification Topic classification is the process of categorizing content into predefined topics. This can be done using a statistical language model that maps documents to topics. These classifiers are called Model Based Topic Classifiers and they are very common in practice. In Model based Classification, algorithm gathers a training set that contains observations that are already classified and uses this dataset to build a topic model. Then the derived model is used to infer topics for new documents. There are two main machine-learning techniques that are being used to build supervised topic models. They are Max-Margin based models and Maximum Likelihood based models. Max-magin models such as Support Vector Machines (SVM) carries out pattern recognition using vector geometry and Euclidean space. This approach tries to find the linear model that creates the classifier with the maximum margin between classes. The Latter approach, Maximum Likelihood Estimate (MLE) based techniques such as Na¨ıve Bayes classifier builds a probabilistic model that represents likelihood. Both approaches have been successfully used with bag of words (BOW). (Wang and Manning, 2012). Training Max-Margin classifiers are relatively computationally less expensive compared to MLE based models. This is because Max-Margin models are based on a Convex optimization problem (Quadratic Programming) and hence has one minimum point (No local minima). Studies have also shown that max-margin methods are effective in both text-based classification and regression problems. (Zhu et al., 2009) There is also extensive research on the marriage of both the Maximum Likelihood Ap- proach and the Max Margin Method. Max Margin Markov Networks (M3N) (Taskar et al., 2003) and Maximum Entropy Discrimination Latent Dirichlet Allocation (MedLDA) Model (Zhu et al., 2012) are models that use the characteristics of both the approaches together. 2.5.3.1 Support Vector Machines (SVM) Support Vector Machines are a family of supervised learning models in machine learning that used for both classification and regression. It is a non-probabilistic binary linear classifier that uses a set of labeled training observations to build a linear model for classification. The examples are represented as data points in a Euclidian space and the algorithm derives the linear hyper-plane that separates two classes of examples with the maximum possible margin. (Cortes and Vapnik, 1995) SVMs can be used to handle non-linear classification problems by transforming the data points to a high dimensional Hilbert space using the Kernal trick.(Boser et al., 1992)
  • 26. Chapter 5. iterature Review 13 There are several advantages of using SVMs for classification. SVMs are: • Uses only a subset of observations to build the decision boundary. Therefore it is memory efficient • Effective in high dimensional feature spaces. Their ability to learn is independent of the dimensionality of the feature space • Effective when the number of observations are smaller than number of features • Different transformations are feasible in Dual form (Kernel functions: Radial Basis, Sin, Gaussian, String and many more. . . ) • String Kernels come in very handy in text analytics • Different regularization approaches can be enforced to control over fitting problem The disadvantages of SVMs are: • The likelihood (confidence) estimates are not straightforward • If the number of observations are very less compared to number of dimensions, performance can be poor 2.5.3.2 Na¨ıve Bayes Classifier Na¨ıve Bayes Classifier is a probabilistic approach that applies Bayes Rule to observation while naively assuming that the features are independent of each other. It calculates the probability of observation n having label y conditioned on the values of features x. Figure 2.2: Na¨ıve Bayes classifier model The Maximum A Posteriori (MAP) is used to estimate P(Y ) and P(X|Y ). Na¨ıve Bayes computes the P(Y ) by counting observations in the training set. There are several types of na¨ıve bayes classifiers such as Multinomial, Bernoulli and Gaussian depending on the assumptions placed on the distribution P(X|Y ). Na¨ıve Bayes Classifiers are advantageous because they:
  • 27. Chapter 5. iterature Review 14 • Are extremely simple as counting is involved • Converges very quickly • Performs reasonably well even if the independence assumption does not hold • Probability estimates are more straightforward • A decent classifier The disadvantages are: • Lack of error guarantees • It is known to be a bad estimator, therefore probability values are not taken seriously 2.5.3.3 Benefits of using SVM for text classification SVM is a learning method that is well founded in the grounds of theoretical understand- ing and analysis. SVMs are a good learning tool for learning due to a few reasons. Strong learning theory base: It also stands very strong in statistical learning theory as it is inspired by the concept of max-margin learning. (Cristianini and Shawe- Taylor, 2000) It is based on structural risk minimization. Better parameter tuning: The margin argument also suggests heuristics for selecting good parameter settings for the learning algorithm. (Joachims, 1997) suggests how to select the kernel width for Radial Basis Function (RBF). This helps to select more efficient parameter tuning without using cross-validation. Apart from SVM’s general ability to learn, this family is suitable for text catego- rization for several reasons Effective in feature spaces with high dimensionality: As the size of feature space doesn’t limit the learning ability of SVMs, this is very suitable for text space where the feature space is the vocabulary that can span to several thousands. The exis- tence of Dual form in max margin based classifiers also make SVMs idea to work with for text data with a large number of features. Few Irrelevant features: By assuming that there is only a few relevant features, it is possible to avoid high dimensional spaces input spaces. But in natural text, most features (tokens) are relevant. Experiments by (Wang and Manning, 2012)
  • 28. Chapter 5. iterature Review 15 using Na¨ıve Bayes classifier to train models using top-ranking features also show that even features that are ranked way down the list carry considerable amount of information. Effective in sparse feature vectors: Documents make up sparse feature vectors as they only contain a tiny fraction of the entire vocabulary.(Kivinen et al., 1995) shows that additive algorithms that have an inductive bias similar to SVMs are ideal for problems with sparse observations and dense concepts. Linearly separable problems: Topic classification is often leads to linearly separable data spaces due to the nature of natural language. Different words have different meanings. As topics in topic classification problems are often independent of each other, relevant features (words or tokens) are often different for each topic. Due to this reason, text categorization problems are often linearly separable. Max margin approaches such as SVMs are very suitable for these linearly separable problems. Due to above-mentioned theoretical reasons and empirical evidence consistently visible through various studies (Wang and Manning, 2012, Joachims, 1997, Kivinen et al., 1995), Support Vector Machines are a very suitable technique to apply in text classification. SVMs further complement the justification by enabling further extensions to apply String Kernels (Lodhi et al., 2002) and different regularizations techniques to further customize the solution to better complement the complexity of the text data. 2.5.4 Latent Topic Detection Natural Language is structured in such a way that documents carry words that relate the documents to more abstract concepts called topics. Normally, a topic can be represented by a group of words that are significantly meaningful in describing a topic. These topics are assumed to be “hidden” (latent) inside corpora. Often these topics are unknown. In a machine learning perspective, latent topic detection is a unsupervised learning problem. These approaches towards it can be construed in two ways: Dimension reduction: It can be interpreted as a dimension reduction process where all the words in a document are reduced to groups of highly correlated words that belong to topics. Examples for algorithms using this approach are Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI). This approach is fairly popular towards Latent topic detection and has showed very promising results.
  • 29. Chapter 5. iterature Review 16 Clustering: Latent topic detection can also be interpreted as a clustering problem. In this approach, documents are treated as observations and clustering can be based on how similar documents are. The corpus is clustered into document groups that are similar to each other. 2.5.4.1 Latent Semantic Indexing (LSI) While tf-idf provides good representation of the words that are discriminative for docu- ments in the collection, it doesn’t provide a heap of information about inter-document or intra-document statistical behaviors. Information retrieval researchers have suggested LSI (Deerwester et al., 1990) to address this problem. LSI is a singular decomposition based dimension reduction technique that finds linear subspaces that best preserve the variance of the data. A significant further enhancement of the concept is achieved later by using probabilistic Latent Semantic Indexing (pLSI).(Hofmann, 1993) pLSI models each word in a document as a sample of a mixture model. Each word is generated by a single topic and different words in a document may be generated from different topics. Both LSI and pLSI has its advantages • More sensible and informative way to model natural language text • Can map large documents into reduced descriptions of linear subspaces • Can capture linguistic aspects such as polysemy and synonymy • Useful to develop generative models from corpora There are also disadvantages • Singular value decomposition and probabilistic parameter training are computa- tionally expensive • It is possible to model text more directly by fitting the model using maximum likelihood or Bayesian methods 2.5.4.2 Latent Dirichlet Allocation (LDA) LDA is a probabilistic unsupervised topic detection approach that is used to detect the hidden topics in a document corpus. It uses a probabilistic graphical model that has been
  • 30. Chapter 5. iterature Review 17 designed to represent the natural language generation process to find concepts/topics (groups of significant words that co-occur) that are hidden in the documents. In the graphical model of LDA, there are two repeating plates, for each document and for each topic. The data generation model assumed in LDA is as follows: For each Document, • N number of words are generated by a Poisson Distribution • Topic distribution has a Dirichlet Distribution • For each word, – Choose a topic from a multinomial distribution – Choose word from multinomial of word generation distribution conditioned on the topic The plate notation of the LDA graphical model is given below. Figure 2.3: LDA notation W is the specific word observation. This variable is observed in documents. The rest are latent variables. Alpha is the parameter for Dirichlet prior on the per-documents topic generation distribution. Beta is the parameter for Dirichlet prior on the per-topic word generation process. Teta represents the parameter topic distribution for each document. Z is the topic allocation distribution for each word in the document. The number of a topics is a variable that has to be manually chosen to better suite for the statistical structure of the text and the ultimate goal.
  • 31. Chapter 5. iterature Review 18 Most natural language models deal with large vocabularies. This raises a serious issue of sparsity. It is highly likely to observe new words in new observations that have never been encountered in training corpora. As Maximum likelihood estimates of multinomial distributions assign zero value to such words, a smoothing step can be added to LDA graphical model to assign probabilities to all words regardless of they were encountered in training phase. There are very good properties in LDA that suite it for latent topic detection: • It is an enhancement of pLSI that uses a Bayesian approach • It is specifically designed for Latent topic detection • More suitable for small datasets as Bayesian approach addresses over fitting • It is highly modular, can be easily changed, extended or tailored to specific appli- cations • Distributed and online LDA versions help deal with time costs There are also some disadvantages of using LDA: • The learning of parameters is computationally expensive • No guarantee to converge to global minima which demands multiple runs 2.5.4.3 Spectral Clustering Spectral clustering is a clustering technique that uses dimensionality reduction before clustering the dataset. As the name suggests, spectral clustering uses the “spectrum” (eigenvalues) of the data to do clustering. Eigenvalues are computed using the similarity matrix that represents the similarity between different observations. Conventionally, data is not transformed using Eigen vectors before clustering. But, Spectral clustering does otherwise. Spectral clustering is ideal when the structure of the cluster is highly non convex or if the center and the spread of the observations is not a trivial representation of the cluster (Which is very likely in document representations). Spectral clustering comes in two flavors Normalized and Un-normalized. Unnormal- ized spectral clustering uses the unnormalized graph Laplcian of the similarity matrix. (Von Luxburg, 2007) The normalized method based on normalized graph Laplacian comes in different flavors (Shi and Malik, 2000, Ng et al., 2002) The advantages of spectral clustering are as follows:
  • 32. Chapter 5. iterature Review 19 • Suitable for clustering problems with non-trivial structure (cluster center and spread) which is ideal for text data • Converges to global minimum that guarantees consistent solutions • Elegant mistake bounds and guarantees • Kernels can be used to tackle non-linear data landscapes Disadvantages of Spectral Clustering are: • More general nature of the algorithm doesn’t specifically focus on text data • Computing Eigenvectors is computationally expensive 2.5.4.4 Benefits of using LDA for text classification After careful consideration and objective analysis, Latent Dirichlet Allocation algorithm is an ideal choice for latent topic detection. There are several reasons behind this justi- fication. The idea choice should be a tool that can enable state of the art performance and accuracy while devising sufficient room for enhancing and specializing to the task at hand. When considering both these requirements, LDA appears to be the most suitable amongst the candidates. The specificity of the solution: The main use case of clean LDA is latent topic de- tection. LDA has been specifically designed to detect hidden topics in natural language. The graphical models shown in 2.3 are specifically designed to model the different elements of the text data generation process. The underlying dis- tributions of different variables and their dependencies directly map to text data generation. Due to this reason, LDA is a strong candidate for latent topic detec- tion. Proven for accuracy: Spectral clustering is a more generic solution that works well for problems that resemble the statistical behaviors similar to NLP problems. But LDA is specifically built to solve the latent topic detection problem. There is more specific attention to details in text data generation process which allows LDA to gear performance and accuracy more. Although pLSI is more accurate than LSI, LDA is an enhancement of pLSI. The main improvement of LDA over pLSI that LDA devices the Bayesian version of pLSI. LDA also inherits better performance in handling over fitting in small datasets due to its Basesian formulation.
  • 33. Chapter 5. iterature Review 20 High modularity: Extending LDA is relatively feasible as it is a highly modular graph- ical model. Correlated topic models (Blei and Lafferty, 2006) can be built by modeling the relationship between latent topics. By modeling topics in a hierar- chy using a Chinese Restaurant Process, hierarchical LDAs (hLDA) can be built. (Blei et al., 2004) The abstract model can further be enhanced into semi super- vised (Zhu et al., 2012) and supervised learning (Blei, 2012) spaces. Due to the high degree of extensibility of LDA model, there is more room for creativity within the solution. Due to the above reasons, LDA is a great choice for a experimental text analytics project of this nature. LDA is a more suitable model specifically built to address text data based topic detection. This choice compliments the solution by enabling high degrees of extensibility allowing to further tailor the tool to solve the problem in hand. 2.6 Topic Consistency Evaluation Capturing the latent topics (set of words) has always been an interesting research area in Natural Language Processing. Although statistical evaluation of topic models (using model perplexity (Wallach et al., 2009)) is reasonably investigated and understood, there has been much less work on evaluating sematic quality of the learned topics. Some efforts are evident in using point wise mutual information to evaluate topic coherence. A very effective intrinsic evaluation technique uses external resources such as WordNet, Google, Wikipedia to predict topic coherence. The models are based on link overlap, term co-occurrence, ontological similarity and various other features. The best performing approach amongst these is using the term co-occurrence within Wikipedia data based on point wise mutual information.(Newman et al., 2010) 2.7 Tools and Technologies 2.7.1 Amazon Mechanical Turk (MTurk) Lately, there has been a lot of popularity in using crowdsourcing to enhance productivity and efficiency in many fields. Data collection and classification is one of the applications in the forefront. A lot of recent research initiatives base the knowledge of the crowds as one of the core components of the project. Projects such as Zooniverse (zoo, 2012) is a good example for this trend.
  • 34. Chapter 5. iterature Review 21 MTurk is an online marketplace that enables individual entities (people and organiza- tions) to coordinate the use of human intelligence to get their work done. This is one of the services offered under Amazon Web Services 1. The main application of MTurk is to use distributed human intelligence in tasks that cannot be fulfilled by computers. In MTurk, a “requester” advertises a job that will be offered to available “workers”. The requester defines the interfaces and other technology required for data entry by using the MTurk Application Programming Interface (API). Requesters can further set qualifications and the price being paid for work undertaken. The “workers” will look at the jobs advertised and work on their desired jobs as they desired and get paid for the amount of work they fulfill. Mechanical Turk is very popular amongst research communities to generate the required data economically. Various studies have shown that although Amazon mechanical Turk is fairly less successful in representing specifi- cally defined populations, it is a great tool for data generation using random sampling (Berinsky et al., Paolacci et al., 2010, Buhrmester et al., 2011). It is also evident that MTurk is fairly economical in conducting surveys and collecting data as the cost is close to half the minimum wage in the US.(Horton and Chilton, 2010) As the project involves topic classification, Amazon MTurk is needed to carry out the initial classification of the dataset. The feedback observations should be placed in MTurk to get them classified into different labels. Then this labeled dataset can be used to carry out supervised topic classification. 2.7.2 Python 2.7 (Programming Language) Python is mainly a scripting language whose design philosophy emphasizes most on code readability. It is a high level programming language that allows programmers to express ideas in simpler and fewer lines of code compared to programming languages such as C++ and Java.(Summerfield, 2008, Guido) Python implementation was started in 1989 mainly led by Guido van Rossum, as a successor of ABC language. Python, like many other programming languages today, is a multi-paradigm program- ming language. It fully supports Object Oriented Programming (OOP) and Structured Programming paradigms. The main attraction of python is its support of Functional Programming paradigm. The core design philosophy of the language itself is the element of fascination about this simple yet powerful tool. • Beautiful is better than ugly • Explicit is better than implicit 1 https://www.mturk.com
  • 35. Chapter 5. iterature Review 22 • Simple is better than complex • Complex is better than complicated • Readability counts (Peters, 2008) Python releases come in 3 types. Backward incompatible versions, major feature releases and bug fixes. Backward incompatible versions are identified by the increasing first part of the version number (eg: Python 2.x vs. Python 3.x). These releases are not expected to work seamlessly with each other. The code has to be manually ported. Major feature releases are identified by increasing second part of the version(eg: Python 2.6 to 2.7). The code is expected to work seamlessly among feature releases. Bug fixes do not get different version numbers. Python 2.7 is an excellent choice for the project due to various reasons. The main rationale behind moving with Python 2 is because Python 3 is not mature and libraries around python 3 is scarce. On the other hand, Python 2 has been maturing for close to a decade and has heaps of special purpose libraries that can embed special features into it. Version 2.7 is idea because it is the latest python 2 version. The support community around python 2 is also mature and dense. The special purpose data handling and machine learning libraries available in Python 2 (such as scikit-learn, pandas, nltk and etc. . . ) are very vital to achieving better and results. 2.7.3 Special purpose libraries Python 2 has a large standard library. Having a massive arsenal of tools suitable for many tasks is one of Python’s greatest strengths. This property of Python is emphasized by using the clich´e “Batteries Included”. The amazing thing about Python is that it is not essential to include the full standard library when running or embedding it. Python packages are standardized and published at Python Package Index (PyPI) 2. As of July 2014, there is more than 46800 python libraries indexed in PyPI. The main functionality covered in python packages include: • System Frameworks: GUI, Web, Multimedia, DataBase • Support Tools: System Administration, Test Suites, Documentation tools 2 https://pypi.python.org/pypi
  • 36. Chapter 5. iterature Review 23 • Scientific Computing: Numerical, Statistical, Text processing, Machine Learning, Visualization, Image Processing Amongst them, the Scientific computing range of libraries are essential for Text analytics projects to carry out the data handling, pre-processing, machine learning, analytical and visualization components. Some of the libraries used for this project are outlined below. 2.7.3.1 Pandas Pandas is an open source, high performance set of data structures and data analysis tools developed for data analysis in Python programming language.(McKinney, 2013) This li- brary offers functions and data structures to manipulate and manage large datasets with numerical tables and time-series data. Pandas was born in 2008 when Wes McKinney started working on creating a more performing and flexible tool to perform quantitative analysis on financial data at AQR Capital Management. There are some great features in Pandas that makes it the ideal choice for a data man- agement tool. • Fast and efficient DataFrame object similar to DataFrame object found in R • Intelligent data alignment and flexibility in reshaping data into different forms • High performance data merging and managing engine that also incorporates hier- archical indexing capabilities • Ability to convert datasets from In-memory data structures to CSV, text, SQL and HDFS data formats and vice-versa • Time series structures, scipy and numpy compatibility Above properties make Pandas the ideal tool for handling large datasets for data process- ing and machine learning tasks. The choice complements the range of other python li- braries that are compatible with pandas data structures (such as scikit-learn and NLTK) 2.7.3.2 Scikit-learn Scikit-learn is a machine learning toolkit written in Python. It is an open source project and is operated to seamlessly operate with python numerical libraries Numpy, Scipy and matplotlib.(Bressert, 2012) Scikit-learn features a range of machine learning algorithms
  • 37. Chapter 5. iterature Review 24 that enable Classification, regression, clustering and dimensionality reduction. It also features additional tools such as model selection and data preprocessing algorithms that compliment the machine-learning offering. Scikit-learn started as scikit.learn, a Google Summer of Code project which was started by David Cournapeau. The codebase was later rewritten by other contributors. Scikit- learn is one of the libraries that grasp popularity and is being well maintained to this day. It also consists of an elegant API to useful machine learning algorithms. (Buitinck et al., 2013) Scikit-learn is a very useful machine-learning library with useful algorithms such as SVMs, Na¨ıve Bayes classifiers, Nearest Neighbor and a lot more. This python library is more further suitable for this project as it also contains text specific preprocessing algo- rithms such as stop word deletion, n-gram bag of word transformation, tf-idf vectorizer and etc. . . 2.7.3.3 Natural Language ToolKit (NLTK) Natural Language TookKit is a computational linguistics library package developed in python programming language. Commonly know as NLTK, this package was initially de- veloped by Steven Bird, Edward Loper and Ewan Klein to support research and teaching in NLP.(Bird et al., 2008) This NLP toolkit includes various linguistic functions such as stemmers, lemmatizes, visualizers, graphical demonstrations and sample datasets. The toolkit is further supported with a book explaining the concepts realized via the lan- guage processing talks in the toolkit.(Bird et al., 2009) It is also accompanied with a cookbook. (Perkins, 2010) This tookit is suitable for this project for numerous reasons. The main one is that it covers a wide range of natural language processing tasks that are potential elements of this project. It contains stemmers, lemmatizers, vectorizers and other language tools that can transform text. The other main feature that complements this project is that NLTK has a wrapper that allows the use of scikit-learn via the API. 2.7.3.4 PyEnchant A main problem that arises when dealing with human generated text is the numerous spelling errors that add noise to the data. When humans enter text, they make mistakes due to various reasons such as literacy, ignorance, accidents and etc. . . But all measures should be taken to avoid these trivial errors from affecting the analyses. The best way to address this issue is to use a spell checker to correct trivial errors in the document.
  • 38. Chapter 5. iterature Review 25 PyEnchant is the python implementation of Enchant library which is written in C. PyEnchat is a generic spell checking library that can be used to correct spelling errors. Enchant project was developed as part of the AbiWord project. Ryan Kelly maintains PyEnchant. It has the capability to load multiple backends at once such as Aspell, Hunspell, AppleSpell etc. . . PyEnchant makes sure that the users can use the native spell checker in various different platforms (Mac OS X, Microsoft Office). It also provides user with the functionality to load dictionaries, add custom words, to query if a word is spelt correctly and request spell correction suggestions for a misspelt word. Above functionality in pyEnchant enables spell checking and spell correction in feedback text which is prone to spelling mistakes. Therefore, pyEnchant can help the analyses by correcting the spelling mistakes and hence lowering the noise in the dataset. 2.7.3.5 Gensim Gensim is an open source topic modeling and Vector Space modeling toolkit that is being developed in Python programming language. It uses numpy and scipy to optimize data handling and Cython to increase performance. The main focus of Gensim is to build online algorithms that can handle large corpora. Gensim has been carefully designed to address the issues of scalability and memory limitations that were holding back large text analyses.(ˇReh˚uˇrek and Sojka, 2010) Gensim provides tf-idf vectorizers and other text analysis algorithms such as random projections, Heirarchical Dirichlet Process (HDP), Latent Semantic Indexing (LSI) and Latent Dirichlet Process (LDA). Gensim is ideal for this project as it includes all the tools to vectorize the text and then run different latent topic detection algorithms such as LSI and LDA. This library complements this ability by also implementing the distributed/ online versions of these algorithms. One good example for this is the inclusion of online LDA algorithm (Hoffman et al., 2010) which will be very useful for this project.
  • 39. Chapter 3 Data Collection and Preprocessing 3.1 Introduction The first step of analyzing text-based data for topics is to collect the required data and to preprocess it to be compatible with the analyses. The primary approach towards building a feedback analysis engine is to build a consistent topic classifier and to build a strong topic detector. The classifier will enable classifying new observations into one or more already defined topics. Building a topic classifier is a supervised learning problem. Labeled observations are required to solve supervised learning problems. The topic detector will enable the engine to learn new emerging topics in the system on the go. This system is required to gain more insight into the meaning of documents that do not belong to any of the topics already defined in the classifier. This is an unsupervised learning problem as the system has no prior knowledge about what topics are present in the system (new topics are not defined). The observations that are being classified as “None of the above” would be ideal for such a task. Once the data is collected, it has to go through several preprocessing steps to filter and improve the effectiveness of data items. Preprocessing enables extracting extra information, better structuring information and reducing the background noise in data before data is analyzed. In this chapter, a detailed discussion will be presented about the data collection process and it results. Explaining the pre-processing steps used to enhance user feedback data will further complement it. 26
  • 40. Chapter 3. Data Collection and Preprocessing 27 3.2 Data Collection Phase The initial dataset for building the text analytics engine was extracted by the Feedback collection system implemented by Qubit Digital. This system prompts their customers seeking for valuable feedback about their products and services while their clients are surfing the company website. Feedback given by the customer would be then stored in a database. The customer feedback observations for this study are extracted from the aforementioned database. 3.2.1 Labelling the observations As the project involves a supervised learning problem, the observations have to be labeled first. Amazon Mechanical Turk was used to label the observations. This is due to two main reasons: 1. Enable maximum utilization of the budget 2. Reasonable reliability for the cost incurred in data collection 10 topics were initially defined for the study. Namely, they are Delivery, Images, Latency, More Functions, Price, Products, Range, Size and Stock. An additional category called “None of the above” was added to add the observations that do not belong to any of the groups. Altogether there are 11 labels in the classification task. The classification task was designed to be carried out in the form of a Multiple Choice Question (MCQ) survey format where the worker has to pick one or more choices (topics) that he/she believes the observation (feedback text) belongs to. The HIT (Amazon, 2014) would start once the worker starts the job. A feedback observation from the dataset will be presented to the worker with the 11 topic choices. The user will have to pick one/ more topics (checkboxes are presented) and submit the job. Then he/she will be presented with another HIT. Most workers will complete multiple HITs in one session. As mentioned in Section (MTurk), there is a major concern about using Amazon Me- chanical Turk for dataset classification. As the compensation for labour is relatively less, workers are likely to be motivated to finish more HITs quicker to earn more money by completing more number of HITs. In the context of MCQ surveys, this might motivate the worker to randomly select choices to complete more HITs per unit time. Researchers should make sure that the workers do not compromise accuracy over high throughput.
  • 41. Chapter 3. Data Collection and Preprocessing 28 In order to address this issue, a trust based heuristic approach is used to quantify the reliability of workers. This approach will be discussed in detail in section 2.4. The data collection approach used to generate the relevant dataset for the task is as follows. First, feedback observations were randomly selected from the customer feedback database. After this, the data classification task is launched using Amazon MTurk. The data classification task is run in two major phases: Initial Trust Modelling phase: In initial trust modeling phase, primary focus is to understand the reliability of the data classification process. An initial sample of the selected observations is used to analyze the behavior of data in terms of worker accuracy and reliability in MTurk. Depending on the behavior of the workers, a set of heuristics and metrics are derived to further assess and refine the reliability of the data collection process. More details about how the trust modeling heuristics are derived is explained in section 2.4 Controlled Data Collection: Once the Heuristics are empowered, the full dataset will be classified using MTurk. This process can be assessed continuously while data classification is going on. Therefore, it is possible to control the reliability factor of data. Once the data is classified, the metrics derived can be used to assess the final dataset and clean the dataset further. The final data collection process and post data collection cleaning phase is described in section 3.2.6 3.2.2 Initial Trust Modelling phase When using MTurk, requesters should take precautions to make sure that they get clean data for analyses. Like every computer system, supervised learning algorithms too follow the Garbage In Garbage Out (GIGO) principal. Therefore considering the topic classification task at hand, the cleanliness and accuracy of the labeled data is extremely vital to ultimate accuracy of the topic classification engine. The primary focus of the Initial Phase is to understand the behavior of the worker reliability in MTurk. One reliable and realistic approach to understand the behavior of the worker reliability is to run an empirical analysis on the process. That is to launch a pilot job and find out how successful the HITs will run. The structure of MTurk is vital to planning the investigation procedure. In Amazon MTurk, workers log in from all over the world to commit to HITs and carry them out. These workers are very likely to be independent from each other. The dataset generation process for the initial phase is as follows:
  • 42. Chapter 3. Data Collection and Preprocessing 29 • Initially, 10,000 unique records were randomly selected from the user feedback database to create a dataset • Each unique observation in the dataset was replicated twice to generate 20,000 more observations. Therefore the dataset consisted of 30,000 observations each with 3 replicates of each unique observation. • Then the dataset was thoroughly shuffled to randomize the sequence of records. • This dataset will be called “The Silver Sample” Ultimately, this dataset was published MTurk as the initial classification job. Once 20,000 observations were complete, the job was kept on hold for a couple of days. This was done deliberately to attract a new set of workers and increase diversity and worker independence of results. Once all 30,000 observations have been classified, the results were collected and analyzed. Dataset consisted of 2 files that contain 19,998 and 10,000 records. The dataset consisted of numerous fields of which the most important fields were extracted and written to a different file. The new dataset consisted of the worker ID, feedback text and the classification. Using this data, several analyses related to the topic distribution, worker scores and question scores were carried out. 3.2.2.1 Topic Distribution of the dataset When Analyzing the topics, we used the ordering of the topics in the User Interface (UI) in order to see if the selections are biased towards the top, middle or the bottom selections of topic choices in the UI. The topics are as follows: From figure 3.1, it is evident that most topics are uniformly distributed except topics 3,7,9 and 10. But this is natural as user feedback can come on any topic and it is likely that some topics are discussed more than the others. The observation makes more sense, as the topics discussed more often are nominally Price, Navigation and More Functions. It is also realistic that the highest number of records belongs to “None of the above” category. This means that customers express feedback about numerous other topics and they do it more frequently. Above observation is an early indication that the workers are not biased towards the obvious choices in UI. This is a good indicator about the reliability of the workers.
  • 43. Chapter 3. Data Collection and Preprocessing 30 Ref Number Topic/Label 0 Delivery and Delivery cost 1 Page loading time, errors 2 Images 3 Price 4 Products 5 Product Range 6 Size 7 Navigation, Search and Design 8 Stock Availability 9 More site functionality 10 None of the above Table 3.1: Legend of X-axis labels for topic histograms Figure 3.1: Topic Distribution of the initial dataset 3.2.2.2 Initial Sanity check As an initial sanity check, a “golden set”, a complete random sample of 100 records were selected from the dataset and was classified manually to check the accuracy of the golden set sample. This is a good approach to do a quick check if the classification process is accurate. Although the sample size of 100 is reasonably small for a population of 30,000 records in terms of confidence, it is a quick measure that is helpful to gain more knowledge with very little time and effort cost. Around 90% of the records were correctly classified in the randomly selected sample. Therefore, it is sensible to think that this too is a good indicator of initial data classification process to be successful. There are only around 10% misclassifications in the random sample. Therefore it is highly probable that the misclassifications are not due to deliberate error. With the accuracy level obtained, it is highly unlikely that making fast money and randomly picking choices motivate the
  • 44. Chapter 3. Data Collection and Preprocessing 31 workers. Due to this reason, a soft assumption can be made that the workers are all motivated towards classifying the data correctly and the misclassifications are mainly due to misunderstanding the context or other human error beyond their control. 3.2.2.3 Worker Scoring As Mechanical Turk job was set up in such a way that three independent workers classify each feedback, majority voting can be used to verify correct answer and hence measure the consistency of the workers. As majority voting can partially verify the accuracy of the classification, concordance of topic choices per single observation can be used as a measurement of confidence level. More concordance gains more confidence on worker classifications. The underlying assumption for such a claim is: If a worker selects topic choices that are similar to those of other independent workers who evaluate the same record, • The selection is more likely to be correct, as two other independent workers have chosen the same selection. Odds are very small for such an occurrence happen randomly when there are 11 choices. • The worker is very likely to possess more ability in finding the right topics, as the worker shows to have the ability to pick the correct answer. Over continuous choice, the consistency of the correct answer is clear evidence of workers ability to classify correctly. On the contrary, if the worker selects topic choices that are different from other workers who evaluate the same question, • The worker is very likely to possess less ability in finding the right topics and more likely to be misclassifying. • But, this doesn’t necessarily mean they are misclassifying, but only that it is highly unlikely that they are classifying correctly. With the above-mentioned logical framework in mind, a concordance score based worker scoring algorithm was built in order to evaluate the reliability of different workers. In order to formulate a scoring function for worker reliability, the following lists have to be generated first:
  • 45. Chapter 3. Data Collection and Preprocessing 32 1. RECORDS1: that contains the list of unique feedbacks in the dataset • Consists of two columns feedback, Indices[index1, index2, index3] • First column represents the unique feedback record • Second column gives a list of indices where this feedback record is repeated (as 3 replicates are present, each index list will have 3 entries) 2. Workers1: unique list of worker Id s in the dataset • Consists of two columns workerID,Score • First column represents the unique worker ID given by MTurk to each worker • Second column holds the computed score for each worker The algorithm for scoring is presented in pseudo code below: Algorithm 3.1: foreach unique feedback in RECORDS1: load the replicate observations using the indices create a label count histogram for label occurrences (in all observations) // to normalize the label count for the number workers foreach entry in label histogram: divide entry by number of workers for that feedback (1,2 or 3) end for // now we have normalized scores for each class within the feedback entry // now lets start scoring workers foreach observation having feedback value: initiate tempScore = 0.0 foreach label in the chosen observation: tempScore += (Normalized label count from histogram) end for // normalize the tempScore for number of labels in that observation tempScore /= (number of labels in observation) Add the normalized tempScore to the employee’s score in WORKERS1 end for end for
  • 46. Chapter 3. Data Collection and Preprocessing 33 count the observation entries per worker normalize the worker score by number of jobs he/she did The Python implementation of algorithm 3.1 is outlined below import gensim import pandas as pd import sklearn import numpy import sys import qbPreprocess as qbPre import qbPrepare as qbPrepare import qbReliability as qbRel # ################################################################################################ # this function computes inference probability for each docuemt for each topic def doTopicAnalysis (path ,topic ,ordData ): lda = gensim.models.ldamodel.LdaModel.load(’data/results/resultSet_ {0} ’.format(path )) numTopics = lda.num_topics # No. of topics in the saved model dictionary = gensim.corpora.Dictionary.load(’bow.dict ’) mm = gensim.corpora.MmCorpus(’data/corpora/corpus_ {0}. mm’.format(topic )) topicDensity = numpy.zeros ([len(mm),numTopics ]) ## to store the average probability values y = qbPrepare.generateY(ordData) classes = qbPrepare. yTransformer .classes_ i = 0 for feedback in mm: doc_bow = lda[feedback] # compute probabilities doc_bow = numpy.array(doc_bow) # convert to numpy array vec = numpy.zeros(numTopics) # new fresh vector for entry in doc_bow: vec[int(entry [0])] = entry [1] # fill the vector with probabilieties # normalize the probability vector norm = sum(vec) vec /= norm maxP = numpy.argmax(vec) topicDensity [i ,:] = vec; # assign value i += 1
  • 47. Chapter 3. Data Collection and Preprocessing 34 scatter = numpy.zeros ([len(classes),numTopics ]) i = 0 for topic in classes: indices = ordData[ordData.answer == topic ]. index tempMat = topicDensity [indices ,:] tempVec = numpy.mean(tempMat ,axis =0) scatter[i ,:] = tempVec i += 1 reducedTopicDensity = topicDensity ; temp = lda.show_topics(numTopics) file = open(’data/topics/resultSet_ {0}. txt’.format(path), "w") i=0; for t in temp: file.write(’topic :{0}================================== n{1}nn’.format(i,t)) i += 1 file.close () # print reducedTopicDensity .shape return reducedTopicDensity ,temp ,scatter # ########### main script ########################################### ## This script computes the topic consistency scores ## main parameters threshold = 0.9; # threshold for record specificity ci = 1.0; # confidence interval scale path = sys.argv [1] topic = path.split(’_’)[0]; ordData = qbPre. readDataFrame (’data/write/ dataSet_None_ {0}. csv ’.format(topic),None ,0) # Load the statistics and probabilities topicDensity ,topics , scatter = doTopicAnalysis (path ,topic ,ordData) numTopics = topicDensity .shape [1] # table to store supervised task based results ( precision and recall ) vals = numpy.zeros ([ numTopics ,2]) # table to store the unsupervised task based results # (lda topic number , positive score , negative score , aggregated score) valsUn = numpy.zeros ([ numTopics ,4]) docCount = len(ordData[ordData.answer == topic ])
  • 48. Chapter 3. Data Collection and Preprocessing 35 wholeCount = len(ordData) diff = len(ordData[ordData.answer == ’None ’]) # foreach topic for i in xrange (0, numTopics ): topicVec = topicDensity [:,i]; # choose inference values for the topic wholeCount = len( topicDensity ) docs = topicVec[topicVec >= threshold] # docs that are top percentile positives docsP = numpy.where(topicVec >= threshold) # indices of top percentile positives docsIndex = list(docsP )[0] notDocs = topicVec[topicVec <1.0 - threshold] # docs that are least percentile negative # indices of docs that belong to the same labeled topic realIndex = numpy.array(ordData[ordData.answer == topic ]. index) # true positives >> intersection tPostives = numpy.intersect1d(docsIndex ,realIndex) # normalise probabities in relation to full distribution # sum of probabilities of all documents equals 1 noralised = True; # parameter can be True/False if noralised: docs /= numpy.sum(topicVec) notDocs /= numpy.sum(topicVec) valsUn[i,0] = i; # LDA topic number valsUn[i,1] = numpy.sum(docs ); # total score of positive documents notDocs = 1.0- notDocs; # linear negative tranformation valsUn[i,2] = numpy.sum(notDocs ); # total score of negative documents valsUn[i,3] = valsUn[i ,1]+ valsUn[i ,2]; # aggregate of positive negative score # compute precision and recall using labelled dataset for comparison precision = float(len(tPostives ))/ float(len(docsIndex )) recall = float(len(tPostives ))/ float(len(realIndex )) # store values for perticular topic vals[i ,0] = precision vals[i ,1] = recall ## normalise the unsupervised positive negative scores vertically dNormalised = True; # parameter can be True/False if dNormalised : valsUn [: ,1] /= numpy.sum(valsUn [: ,1]) # positive proportions valsUn [: ,2] /= numpy.sum(valsUn [: ,2]) # negagive proportions valsUn [: ,3] = 2* valsUn [: ,1]+ valsUn [: ,2] # weighted proportions # topic with scores maxP = numpy.argmax(vals [: ,0]) # topic with thighest precision maxPUn = numpy.argmax(valsUn [: ,1]) # topic with highest positive score
  • 49. Chapter 3. Data Collection and Preprocessing 36 ## you only need this if the positive values are too close to eachother ... # select the second highest positives tempValsUn = numpy.delete(valsUn ,maxPUn ,0) # delete highest from the table max2PUn = numpy.argmax(tempValsUn [: ,1]) # select highest after that -> 2nd vec = vals[maxP ,:] # load the supervised record for most precise LDA topic vecUn = valsUn[maxPUn ,:] # load unsupervised record for most positive LDA topic # compare the overall score on 2 most positive LDA topics comparer = numpy.zeros ([2 ,4]) comparer [0 ,:] = vecUn; # record with highest positive score comparer [1 ,:] = tempValsUn[max2PUn ,:] # record with 2nd highest positive score # compute standard diviation of positive scores among LDA topics std = numpy.std(valsUn [: ,1]) step2 = False; maxP2Un = int(comparer [1 ,0]) vec2Un = valsUn[maxP2Un ,:] # if difference between first 2 records is smaller than scaled s.d.: if (comparer [0,1]- comparer [1,1])<ci*std: step2 = True; # write stats to file file = open(’data/stats/statistics_ {0}. txt’.format(path), "w") file.write(’n’) for i in xrange (0, numTopics ): file.write(’topic :{0}: {1}nn’.format(i,topics[i])) file.write(’topic :t{0}t | precision :t{1}t | recall :t{2}tn’.format(i,vals[i ,0]*100.0 , v file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}nn’.format(i,va file.write(’nn Most Consistant topic nn’) file.write(’topic :t{0}t | precision :t{1}t | recall :t{2}tn’.format(maxP ,vec [0]*100.0 , vec [1 file.write(’nn Results nn’) file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}n’.format(maxPUn ,vec file.write(’topic :t{0}t | docs :t{1}t | notDocs :t{2}t | summary: t{3}nn’.format(maxP2Un , if step2: file.write(’values too close !!’) sortedArr = qbRel.pickObs(topicDensity ,10, maxPUn) topNdocs = ordData. declaration [sortedArr] file.write(’nn Top {0} observations nn’.format (10))
  • 50. Chapter 3. Data Collection and Preprocessing 37 for i in topNdocs: file.write(’{0}nn’.format(i)) file.close () When analyzing, all the workers that have completed fewer than 100 jobs were ignored as these observations provide very limited information compared to workers who have done more jobs. Under this elimination criteria, 10 workers among 37 unique workers remained. Table below provides the summary statistics of those workers. Figure 3.2: Worker aggregated performance over lifetime The explanation of the fields is as follows: Worker ID: the unique workerID assigned by Mturk for privacy reasons Jobs: the number of total HITs completed Max Ration: number of times the worker scored highest score / number of total jobs Min Ratio: number of time the worker scored least score/ number of total jobs Mean: the average topic selection of the overall lifetime Mode: The most frequently selected Topic Mode Ratio: the ratio (%) the worker selected the most frequent label classification Median: The median topic selection of the overall lifetime From this table, we can observe that majority of the workers have obtained scores close to 80% which means more of their classifications agree with other independent workers. Another observation is that all the workers who have scores ¿80% have pleasing Min Ratio values. The similarity between Mean and Median among all workers suggests that the feedback text is uniformly being distributed among them for classification.
  • 51. Chapter 3. Data Collection and Preprocessing 38 3.2.2.4 Experience of worker Further analysis was carried to see if the lifetime of the worker (0%-100%) would have an impact on the worker performance. Figure 3.2 plots how different workers maintain their aggregate normalized accuracy score over their lifetime classifying the silver set. Figure 3.3: Worker aggregated performance over lifetime The plot in figure 3.3 depicts how the score of different workers change with respect to the number of HITs they carry out. Each line on this plot represents a distinct worker. For normalizing all the workers, the number of HITs was converted to percentage completed (X-axis). Therefore all workers will be represented in terms of their lifetime progress. The Y-axis is the normalized aggregated score. The Y value represents the exact trust score worker has scored over his/her lifetime. The smoother lines with detailed movements represent the workers who have done more HIT’s. The more rapid lines with long straight-line segments represent workers who have done a small number of HITs. There are several important observations that can be drawn from this plot. • The workers who have responded to less questions tend to do bad • Most workers tend to do well from the beginning • Majority of workers tend to do better than 75% accuracy
  • 52. Chapter 3. Data Collection and Preprocessing 39 3.2.2.5 Unique Feedback Scoring Just like assessing worker reliability, it is also possible to assess the complexity of a feedback observation. By using the same concordance based approach as earlier, the difficulty level of observations can be measured. The underlying assumption for this claim is: If an observation gets the same topic selection from multiple independent workers, • The selection is more likely to be correct, as three independent workers have chosen the same selection. Odds are very small for such an occurrence to happen randomly when there are 11 choices. • The observation is very likely to be easy to understand as multiple workers have been able to select the correct classification for it On the contrary, an observation gets the different topic selections from multiple inde- pendent workers, • The workers are very likely to possess less ability in finding the right topics and more likely to be misclassifying the particular observation • This also suggests that it is highly unlikely that they are classifying correctly hence making the observation confusing and difficult. With the above framework in mind, a concordance score based observation scoring algo- rithm was built in order to evaluate the difficulty of different feedback observations. In order to develop the difficulty scores for feedbacks, the following lists have to be defined first, 1. RECORDS1: that contains the list of unique feedbacks in the dataset • Consists of two columns feedback, Indices[index1, index2, index3] • First column represents the unique feedback record • Second column gives a list of indices where this feedback record is repeated (as 3 replicates are present, each index list will have 3 entries) The algorithm for scoring is presented in pseudo code below: Algorithm 3.2:
  • 53. Chapter 3. Data Collection and Preprocessing 40 foreach unique feedback in RECORDS1: load the replicate observations using the indices create a label count histogram for label occurrences (in all observations) // to normalize the label count for the number workers foreach entry in label histogram: divide entry by number of workers for that feedback (1,2 or 3) end for // now we have normalized scores for each class within the feedback entry // now lets start scoring the feedback value initiate tempScore = 0.0 foreach label in the label count histogram, tempScore += (normalized label count from histogram) end for // normalize the tempScore for number of labels in that feedback value tempScore /= (number of labels in histogram) end for As same as in scoring workers, the observations of workers who have completed fewer than 100 HITs were discarded. Figure 3.3 shows the histogram of the distribution of concordance-based difficulty amongst 10000 unique feedback observations. In figure 3.4, the X-axis represents the normalized score for feedbacks. It can span from 0.33 (1/3) – 1.0. The Y-axis represents the frequency of different scores in the dataset. The histogram takes a step shape, as this is a cumulative histogram. From figure 3.3, it is observed that most of the scores received are greater than 0.5. It is also evident that the majority (close to 50%) of feedbacks have received a score of 1.0, which means they have got full concordance. This means that there is at least 50% chance that a feedback will receive full concordance. It is also strongly visible that almost 75% of the feedback observations have scored at least 0.5. These statistics are supportive evidence towards the consistency of the classification process.
  • 54. Chapter 3. Data Collection and Preprocessing 41 Figure 3.4: Step Histogram of Feedback difficulty score 3.2.2.6 Time of Day It is also worthwhile to see if the time of day has an impact on the classification. Some- times, the error rate may not be due to cognition or understanding, but due to drowsi- ness, tire and other factors. Therefore it is interesting to analyze the worker statistics and plot the worker performance on the silver set in terms of the time of day they re- sponded. Figure 3.4 shown the exact score each observation scored on different time of day. Figure 3.5: Score for observations on different time of day (0001h-2400h)
  • 55. Chapter 3. Data Collection and Preprocessing 42 Figure 3.5 plots the score for each observation against the time of day they were captured. Each dot on the scatter plot represents a distinct observation. The X axis represents the time of day they were captured (0000h-2400h). The Y-axis represents the score obtained by the observation using algorithm 3.2.From this plot, it is observable that there are no trends suggesting correlations between the hour of day to the performance of the workers. It seems like the scores are scattered uniformly throughout the timespan. Therefore, it is evident that the time of day has no significant effect on the score of the observation. 3.2.2.7 Replicated feedback scoring A selection criterion is formulated to select the most reliable classification amongst the 3 replicates. Rather than depending on the earlier derived worker reliability metric to compare the 3 replicates, it is more sensible to use criteria that are dependent on concordance. As same as above arguments, more concordance gains more confidence in the correctness. The underlying assumption for this claim is as follows: • 3 replicates can have different label combinations • False positive classifications have a significant effect on training as it affects the decision boundary • False negatives do not have such an impact as a missing classification doesn’t impact the decision boundary like a misclassification • For example, if example A is classified [Price] and example B is classified [Price, Shipping], the function should score example A more. This is because if label shipping is a classification mistake, it will have greater negative impact on the classification model. The algorithm for scoring is presented in pseudo code below: Algorithm 3.3: foreach unique feedback in RECORDS1: load the replicate observations using the indices create a label count histogram for label occurrences (in all observations) // to normalize the label count for the number workers
  • 56. Chapter 3. Data Collection and Preprocessing 43 foreach entry in label histogram: divide entry by number of workers for that feedback (1,2 or 3) end for // now we have normalized scores for each class within the feedback entry // now lets start scoring workers foreach observation having feedback value: initiate tempScore = 0.0 foreach label in the chosen observation: tempScore += (normalized label count from histogram) end for // normalize the tempScore for number of labels in that observation tempScore /= (number of labels in observation) end for // after scoring, pick the highest scoring observation compare the tempScore earned by each observation select the observation with highest tempScore for the final dataset end for This algorithm will allow scoring each observation and to select the most reliable obser- vation in terms of label concordance. 3.2.3 Selection and Rejection Criteria Aforementioned analyses give a very good understanding about how the data is being generated. These results can be used to define a few observation selection and rejection criteria. Firstly, a few observation rejection criteria was defined to make sure only the observations classified by the most reliable workers are being selected. Reject all observations classified by workers who have completed less than 100 HITs: There are a few motivations behind this decision. • Figure 3.3 suggests that the workers who have completed a very few tasks tend to underperform. • Lack of continuous involvement may suggest that the workers are not committed enough to the job.
  • 57. Chapter 3. Data Collection and Preprocessing 44 Reject the first 5% of classifications by all selected workers: Figure 3.4 in section (SEC- TION) clearly shows that all workers tend to do bad in the beginning of the task before they start gaining on their cumulative score. The figure further shows that the pivoting point falls between 0%-5% interval of their lifetime. Therefore, the interval between 0%-5% can be identified as a “burnout” period where the worker gets accustomed to the nature of the job and adapt to it. Reject observations from workers who had a final aggregate score of less than 75%: Figure 3.3 suggest that all workers who have a cumulative score which is ¡75% have been misclassifying continuously Once the observations based on worker performance is eliminated from the initial dataset. It is still left with a fair fraction of replicated observations that have been classified by reliable workers. Rather than depending on the earlier derived worker reliability metric to compare the 3 replicates, it is more sensible to use algorithm 3.3 to score the obser- vations. Algorithm 3.3 uses label concordance to weight observations. This will assure that the observation with highest concordance score gets selected to the final dataset that will be used to train the supervised learning model. As figure 3.5 suggests no time of day based trends, no observations were discarded using this basis. 3.2.4 Directions for final data collection Although it is possible to enforce the same data collection technique for remaining data collection phase, it is very expensive as 3 HITs have to be allocated for each unique observation. The remaining budget was sufficient for 45,000 unique HITs. If the earlier method was devised, only 15,000 unique observations would be obtained. The initial study is clear evidence that workers reliability and performance is satisfactory. Therefore it is a waste of resources to empower such strong measures similar to the ones of the initial data collection phase. It seams like overkill of resources. Therefore, more lenient data collection methodology was enforced for the final data collection phase. Figure 3.3 clearly shows that more than 5000 unique observations have obtained full concordance. These observations have obtained the same combinations of labels from 3 independent classifications carried out by 3 independent workers. According to the argument in section 3.2.2.3, a random occurrence of this nature is highly unlikely given there are 11 choices. Due to this reason, these observations are assumed correctly classified. Although this is a soft assumption, validity of this assumption is highly
  • 58. Chapter 3. Data Collection and Preprocessing 45 probable. As this is a soft assumption, this data will be called the “Silver Sample” in upcoming sections. These fully concordant observations (observations that received the same classification in all 3 replicates) are used to keep track of the performance of the final data collection phase. For the final data collection phase, 40,000 new feedback entries were randomly selected from the feedback database. No replicates were generated from this dataset. After that, 5,000 observations were randomly picked from the set of fully concordant observations. Then, these 5,000 fully concordant observations (Silver Sample) were sprinkled into the 40,000 record dataset. Both the new records and the silver sample records were merged into one dataset and shuffled. Shuffling guarantees that the silver set is properly mixed with the dataset. Therefore, all workers are equally likely to classify Silver Sample observations during the final data collection process. This allows the requester to continuously analyze the overall health of the data collection process by assessing how the current workers are classifying the Silver Sample. 3.2.5 The final strategy for data collection The final strategy for data collection can be summarized as follows: 1. Publish the classification job in MTurk using the newly created dataset (40,000 new records + 5,000 silver sample) 2. Set up Bachelors degree as the minimum required qualification for workers 3. Constantly check how the workers are responding to Silver Sample examples 4. If the performance for Silver Set is bad, hold the job. 5. Then restart the classification job after a few days with a new set of workers 3.2.6 Final Data collection phase Once the dataset was published for classification, 15,000 observations were classified in 2 days. The analysis showed that workers were classifying the silver sample observations consistently. Therefore the job was not halted. Once the remaining 30,000 records were classified, results were downloaded and analyzed. Figure below shows that the workers in final classification task have performed very well.
  • 59. Chapter 3. Data Collection and Preprocessing 46 Figure 3.6: The Score distribution of the silver set observations in final dataset Figure 3.7 plots the cumulative histogram of score distribution of the final dataset. The plot considers the 5,000 observations from the silver sample that was sprinkled in the final dataset. The X-axis represents the similarity score between the old workers and the new workers. A score of 0.0 means the new workers have classified the observation completely differently from the 3 replicates from the initial phase. A score of 1.0 means the new worker has classified the observation exactly as the replicates. Any score in between represents partially similar classifications. From figure 3.5, more than 83% agreement between the new worker opinion and the replicates is evident. These very good results strongly suggest that the data classification process is very consistent. 3.3 The Final Dataset The final Dataset was prepared in the following manner: • Altogether 50,000 unique feedback records were classified (10,000 + 40,000 - 5,000) • Observations from the Silver Sample were added • From partially concordant observations in the initial dataset, the most reliable observations were selected using Algorithm 3.3 scoring function • The 40,000 unreplicated observations from the final data collection phase were added to the dataset. • Then the observations classified by unreliable workers were discarded using the criteria specified in section 3.2.4
  • 60. Chapter 3. Data Collection and Preprocessing 47 • Multiple duplicate records were detected in the two classified datasets. Only the most reliable classification for these duplicates were chosen using algorithm 3.3 • 5,133 records were discarded due to above reasons • The final dataset consists of 44,877 unique observations 3.3.1 The topic distribution of the final dataset topic dist.png topic dist.png Figure 3.7: Topic Distribution of the final dataset From the histogram in figure 3.7, it is evident that there is no User Interface bias evident in the final dataset. The large fraction of “None of the above” observations support the trend observed in initial data collection stage (Figure 3.1). The high frequency in Price and Navigation also follow from figure 3.1. Therefore, it can be concluded that the data collection has be executed well. 3.4 Preprocessing steps Once the data is collected and cleaned, the dataset is should be prepared for learning tasks. A number of pre-processing steps are necessary in order to prepare the dataset for
  • 61. Chapter 3. Data Collection and Preprocessing 48 supervised topic classifications. The text preprocessing steps used are standardization, spell correction and stemming respectively. These steps have to be applied sequentially for best results. 3.4.1 Text Standardization Text standardization is used to make sure that all text complies with a single standard. This is important to make sure that tokens that vary from each other due to Capital- ization, punctuations and other grammatical rules are taken out. It also helps to make the tokenization process less cumbersome by eliminating exceptional use of characters. As part of text standardizing, several rules were applied to the text at hand. 1. All text was converted to Unicode encoding (UTF-8). This is important to convert all the special character and non-Unicode data into Unicode data which is easy to manage. Standardization enhances safety by eliminating probable exceptions due to different encodings 2. All text was converted to lower case. This step is important for word standardiza- tion. 3. All the special characters in the dataset was removed. Any character other than letters and numbers were replaced by a space. This is important as special char- acters are likely to affect the word tokenization process. They can also increase runtime exceptions while transforming data As the three above mentioned standardization steps are applied to the dataset, the result obtained is a UTF-8 encoded text dataset with all lower case letters and numbers only. Any character other than alphanumeric are removed. This dataset is very clean. Standardization is a vital role in text preprocessing as special charactors and improper encoding can cause a lot of problems in spell correction, stemming and text tokenization. 3.4.2 Spell correction Spell Correction is required for this dataset as the records are customer-generated feed- back content. Customer end users who generate feedbacks have no formal obligation to create accurate content. Due to the commitment of the user and various other reasons such as the urgency of the user leads to trivial spelling errors. These errors have to be corrected as part of pre-processing to increase cleanliness of the dataset. Spell correc- tion gets rid of a reasonable amount of removable noise in text data. For this study,
  • 62. Chapter 3. Data Collection and Preprocessing 49 pyEnchant spell checking library is used. As the target customer-base of the feedback collection system is Great Britain, British English dictionary (“en-GB”) was used for spell correction reference. (AbiWord, 2005) For spell correction, the misspelled word is always replaced by the first suggestion from PyEnchant 1 library. Although this might lead to another erroneous suggestion rarely, it suggests the correct word in often cases. Therefore, it helps to clean the dataset reasonably. When the erroneous word returns no suggestions, the word is left, as it is uncorrected. 3.4.3 Stemming Stemming acts as a dimension reduction technique that will reduce different variants of the same word to its stem form. Porter stemming algorithm (Porter, 1980) has been used do stemming on the feedback dataset in this study, as it is very popular to give good results. Porter stemmer is located in nltk.stem.porter module in NLTK 2 python library. No special parameters changes were imposed on the stemming procedure. 3.5 Preprocessing sequence Above-mentioned preprocessing should be done in a particular sequence to get best results. First, the text standardization step is carried out. This eliminates all the special characters except for alphanumeric characters. It also converts the whole corpus to lower case. Once standardization is complete, spell correction step is run. As the standardization step has gotten rid of all the special characters, there is very little chance for exceptions to be thrown at this phase. During spell correction phase, all the erroneous words are replaced with the closest suggestion to them in the British English dictionary. If no suggestions are found, the word is left unchanged. As there are no special characters in the corpus, spell correction is less challenging. This is the reason why standardization should be carried out before spell correction. After spell correction, stemming step should be run. As the spelling has already been corrected in the dataset, stemming will consistently reduce the text to the word’s root form. Spell correction has to be carried out before stemming due to this reason. Once stemming is complete, the corpus has completed all three pre-processing steps. 1 http://pythonhosted.org/pyenchant/api/enchant.checker.html 2 http://www.nltk.org/api/nltk.stem.html
  • 63. Chapter 3. Data Collection and Preprocessing 50 3.6 Preprocessing pipelines Although the three preprocessing steps mentioned in section 3.4 can generate a cleaner corpus with reduced vocabulary, spell correction and stemming may discard a fair amount of diversity from the corpus. For example, there might be special spelling trends that are not found in standard dictionaries that should be captured. Also, the different variants of words may be important towards better determination of topics. Due to these reasons, spell correction and stemming might harm the topic categorizations sometimes. Taking these possibilities to account, three pipelines were used to preprocess the corpus for topic classification. The pipeline specification is identified using a bit string of length 3 which can hold values 1 or 0 in each bit position. First bit is 1 if standardization is applied. Second bit is 1 if spell correction is applied. Third bit is 1 if stemming is applied. Three pipeline specifications consist of different combinations of gradually progressing preprocessing steps. Text standardization is trivial to preprocessing and hence applied in all 3 specifications. Pipeline Specification 100: 1. Only applies text standardization Pipeline Specification 110: 1. Applies text standardization 2. Then applies spell correction Pipeline Specification 111: 1. Applies text standardization 2. Is followed by spell correction 3. Finally, stemming is applied
  • 64. Chapter 4 Topic Classification for Labelled Observatoins 4.1 Introduction Dataset has already been classified using Mturk and cleaned. The ultimate dataset has been finalized and prepared for topic categorization. Now it is time to use this labeled dataset to carry out topic categorization using supervised learning methods. The idea is to start from a linear Support Vector Machines (SVM) algorithm and extend to kernelized dual form to improve efficiency as justified in section 2.4.3.3. The kernelization and extension to dual form is only necessary if the simple linear primal form is unable to achieve promising results. Due to the near linear nature of natural language, it is less likely that dual form would be necessary. 4.2 Implementation techniques There are several considerations that have to be done when using SVMs in topic cat- egorization. Feature extraction is an important component of it. Features are mainly extracted in the form of tokens. Then they are transformed using TF-IDF transfor- mation. Choosing parameters also plays a vital part in achieving good results in SVM classification. Parameters are involved in features selection and classification phases. In addition to this, sampling the training and test sets and cross validation also plays a role in enhancing the reliability of results. 51
  • 65. Chapter 4. Topic Classification for Labelled Observatoins 52 4.2.1 Extracting Features At first, stop words are removed from corpus. This eliminates trivial stop words from the dataset. As described in section 2.4.3, n-gram Bag of Words is a very popular method of text tokenization. Using this feature extraction technique, Unigram, Bigram and Unigram+Bigram text tokenization is used after stop word removal. Unigram features depict the effect of individual words in the dataset. Bigram features allow the features to retain the sequence information about words. Once tokenization is complete, TF-IDF text transformation is used to weight the effect of different words. Tf-idf lends high discriminates power to words that occur in a limited number of documents. 4.2.2 Selecting Features Regularization plays a big role in feature selection as it enables model selection by controlling over fitting. Regularization is theoretically justified as it tries to impose Occam’s Razor on the solution model. The Bayesian point of view on this is similar to imposing specific priors to model parameters. Lasso (L1) and Ridge (L2) are the two main regularization methods applied in supervised learning models. Lasso regularization is used in this particular problem for feature selection. This decision is inspired by a number of properties of Lasso regularization. • Lasso sample complexity only grows logarithmically with increasing number of irrelevant features • Logarithmic dependence on input dimensions reaches the best-known bounds for various feature selection contexts.(Ng, 2004) • Great for text data that has a large dimension space with a heap of irrelevant features • Lasso converges the weights of all irrelevant features to exactly zero. This allows complete elimination of irrelevant features from the model • Convergence to zero weight is suitable for natural language as irrelevant tokens can be ignored completely 4.2.3 Selecting Parameters Certain parameters can be set in SVMs. They are:
  • 66. Chapter 4. Topic Classification for Labelled Observatoins 53 • Multi-label or multi-class classification: Multi-label assumes that each observation belongs to one or more labels (classes). Primarily, One vs. Rest method is used for multi-label classification. This method is where a binary classifier is built for each label and the most confident classification is chosen. • Multi-class assumes that each observation belongs to only one of multiple labels (classes). Multi-class can be implemented using two main methods. One Vs. Rest classifier builds N binary classifiers for N classes/labels. In multiclass scenario, the highest confident label is chosen for the observation. One vs. One classifier builds (N*(N-1))/2 classifiers for each combination of N classes/labels. Crammer and Singer (Crammer and Singer, 2001) approach is another strategy to do multiclass SVM classification. • On the other hand, One vs. Rest classifier is used to conduct multi-label classifi- cation. Multi-label classification is the scenario where each observation can belong to one or more classes/labels. • Regularization parameter: C is the main parameter that is important in the primal form SVM. C controls the penalty for misclassification. A bigger C leads to more penalization. This leads to more bent solutions that under-fit aggressively. A smaller C leads to over-fitting problem. The most preferred technique to choose C is using grid search. • In the scenario where the dual form of SVM is used, a new parameter comes into play, as a Kernel function will be devised. The choice of Kernel function itself is a parameter. Different factors are considered when choosing the right kernel for similarity. • With the choice of different Kernels, additional kernel specific parameters are added to the model. For instance, polynomial kernel adds an additional parameter for the number of polynomials and the tradeoff between lower and higher order terms. 4.2.4 Final Process Pipeline Multiple experiments were run on topic classification using SVM as the primary learning algorithm. Three corpora were created from the dataset using the preprocessing pipelines specified in section 3.5 (specification 100, 110 and 111). Briefly the specifications are: • 100 (Text Standardization) • 110 (Text Standardization + Spell Correction)
  • 67. Chapter 4. Topic Classification for Labelled Observatoins 54 • 111 (Text Standardization + Spell Correction + Stemming) Initially, for each corpus, feature extraction was carried out using 3 tokenization specifi- cations described in section 4.2.1. Initially 60% of the total observations were randomly selected for training set. The remaining 40% was used as the test set. To ensure the statistical accuracy of the results, 5-fold cross validation was used during model training. Values (0.1, 0.5, 0.7, 1.0, 1.3 and 2.0) were used to tune the regularization parameter. 4.3 Results Once topic classification is carried out, results have to be assessed before selecting the best model for topic classification. The accuracy of different models was assessed using misclassification error. Using misclassification error gives more intuitive and realistic understanding of the results. 4.3.1 Evaluation metric As the topic categorization problem is treated as a multi-label classification problem, Hamming distance between the actual and predicated topic classification metrics was used to assess misclassification error. Hamming loss (Tsoumakas and Katakis, 2007) computes the fraction of labels that were incorrectly classified. This is different from jaccard similarity score as jaccard similarity score correspond to the subset accuracy where the labels predicted to an observation should be exactly same for the classification to be identified as correctly classified. 4.3.2 Unigram Model Results Unigram based classification model tokenizes the corpus on word-by-word basis. Words are treated as independent features in this model. The primary motivation behind building this model is to analyze if the individual word features have an effect on topic classification model. As mentioned in section (SECTION), the models were trained for corpora preprocessed under the 3 mentioned specifications (section 3.6). The results show that when regularization parameter C=0.7, the models show best accu- racy results in 2/3 classification models. Classification model trained with preprocessing specification 111 performs the best.
  • 68. Chapter 4. Topic Classification for Labelled Observatoins 55 Figure 4.1: Best results for Unigram model in specification 111 Topic/Label precision recall f1-score support Delivery 0.96 0.85 0.90 1106 Images 0.84 0.71 0.77 905 Latency 0.79 0.50 0.61 730 MoreFunc 0.70 0.40 0.51 2187 Navigation 0.79 0.62 0.69 3298 Price 0.91 0.83 0.87 1975 Products 0.69 0.21 0.32 780 Range 0.68 0.36 0.47 1298 Size 0.83 0.77 0.80 971 Stock 0.82 0.58 0.68 752 avg / total 0.80 0.59 0.67 14002 Table 4.1: Topic wise classifier results breakdown (Unigram 111 From figure 4.1, it is evident that the best accuracy in tuning parameter C is obtained when C = 0.7. It is also evident that all C values give very promising hamming loss accuracy results with Standardization + spell correction + stemming specification. The 5-fold cross-validated average hamming loss accuracy of this model is 93.9%. The breakdown of One-vs-Rest result breakdown for the overall classifier is given below. From table 4.1, it is observable that Delivery and Price have performed extremely well in terms of both precision and recall. And overall average precision of individual topic classifiers is 80%. The evidence shows that almost all the topics are performing fairly well with a linear separator. The support column also shows that the maximum number of support data points used in SVM is 3298. For each One vs. Rest classifier, these number of supports are very good values as they are around 10% of the training set
  • 69. Chapter 4. Topic Classification for Labelled Observatoins 56 except for one occurrence. As the training set is 60% of the corpus, 10% of the training set is approximately 2,700 (45,000 x 60% x 10%). 4.3.3 Bigram Model Results Bigram tokenization is different from Unigram as Bigram model tokenizes two consec- utive words together as one token. This method enables preserving the word order information. In some applications, word order features can be very important as some words do not make sense individually. The very phrase “Natural Language Processing” itself is a great example for this. The bigram features were also fit to corpora that were processed with the preprocessing specifications outlined in section 3.6. Results show that bigram models perform well well the regularization parameter C = 1.0 in all 3 specifications. The results also show that Hamming loss accuracy increases gradually when additional preprocessing steps are used with the corpus. According to the results, specification 111 (Standardization + Spell correction + Stemming) performs best among bigram models. Figure 4.2: Best results for Bigram model in specification 111 From the results on grid-search based results in figure 4.2, it is evident that the model performs best when C = 1.0 in bigram features. The average accuracy using Hamming loss misclassification error is 92.3% when C = 1.0. The table below breakdown the accuracy of label wise classifiers. The breakdown of One-vs-Rest result breakdown for the overall classifier is given below. From table 4.2, it is evident that Delivery and Price classifiers have done great in Bigram feature space as well. But it is evident that recall values are not very promising in bigram
  • 70. Chapter 4. Topic Classification for Labelled Observatoins 57 Topic/Label precision recall f1-score support Delivery 0.95 0.62 0.75 1130 Images 0.83 0.44 0.58 925 Latency 0.83 0.34 0.48 724 MoreFunc 0.68 0.26 0.37 2223 Navigation 0.79 0.41 0.54 3366 Price 0.91 0.57 0.70 1947 Products 0.67 0.14 0.23 759 Range 0.66 0.17 0.27 1317 Size 0.84 0.49 0.62 934 Stock 0.82 0.41 0.55 711 avg / total 0.79 0.39 0.52 14036 Table 4.2: Topic wise classifier results breakdown of Bigram 111 space in comparison to unigram space. The number of supports is also very satisfactory as they are all around the 10% mark. 4.3.4 Unigram + Bigram Model Results Unigram features identify the importance of individual word features. Bigram model preserves the word order and emphasizes on its importance to model building. Having both unigram and bigram features in your feature vector enables to build the final model using both the above mentioned aspects of the corpus. But the main downsides of this approach are: • The nature of feature extraction double counts words in unigram and bigram forms • Unigram + Bigram features lead to a very large vocabulary From the results for unigram + bigram features, It is evident that the Hamming loss based accuracy gradually increases with increasing levels of preprocessing steps. Figure 4.3 shows how specification 111 performs best. When C = 0.7, model outperforms other parameter values in all preprocessing specifications. Following table outlines the result breakdown of the label classifiers. The breakdown of One-vs-Rest result breakdown for the overall classifier is given below. From table 4.3, it is evident that Delivery and Price classifiers are performing fairly well in this feature space as well. Recall values are close to performance in unigram space. The number of supports is also very satisfactory as they are all around the 10%
  • 71. Chapter 4. Topic Classification for Labelled Observatoins 58 Figure 4.3: Best results for Unigram + Bigram model in specification 111 Topic/Label precision recall f1-score support Delivery 0.95 0.86 0.90 1087 Images 0.86 0.69 0.77 919 Latency 0.78 0.48 0.60 717 MoreFunc 0.70 0.39 0.50 2229 Navigation 0.79 0.60 0.68 3347 Price 0.91 0.81 0.86 1970 Products 0.69 0.18 0.28 779 Range 0.69 0.34 0.46 1305 Size 0.82 0.77 0.79 901 Stock 0.83 0.60 0.70 745 avg / total 0.80 0.58 0.66 13999 Table 4.3: Topic wise classifier results breakdown Unigram and bigram 111 mark. This feature space holds the least number of total supports although it only beats unigram space by 3 support data points. 4.4 Conclusion Support vector machines (SVM) algorithm is ideal for the topic classification phase mainly due to its suitability to natural text data. The near linear separable nature of text and the suitability of feature selection techniques like Lasso for sparse feature vectors make SVM an ideal choice. Preprocessing steps also play a vital role in text data. Therefore, 3 preprocessing specifications were used to generate 3 corpora from the same user feedbacks. Results consistently show that Specification 111 (word standardization + Spell Correction + word Stemming) transformation has performed best in classification.
  • 72. Chapter 4. Topic Classification for Labelled Observatoins 59 When comparing the 3 feature extraction techniques, it is evident that Unigram features and Unigram + Bigram features perform reasonable better. Bigram only feature space performs fairly bad in comparison to others. Therefore, choosing bigram features is not ideal. In contrast to Unigram features, Unigram + Bigram features take a fairly longer time to train the model. The difference between training times is close to 0.5 seconds. As Unigram feature space leads to a smaller feature space that generates best results, Unigram features are the best choice for topic classification using SVMs. Grid- search in unigram feature space provides empirical evidence that 0.7 as the regularization penalty is the best choice for C. Finally, SVM model on unigram features of the corpus obtained using preprocessing specification 111 performs best for topic classification. The smaller vocabulary leads to a shorter training time. It leads to 93.9% hamming loss-based accuracy. The precision and recall in individual topics is also best in this model. The number of support vectors on each label is also very satisfactory in this model.
  • 73. Chapter 5 Topic Detection Automation 5.1 Introduction From the results of figure 3.2 in Section 3.2.2.1 and figure 3.7 in Section 3.3.1, it is evident that the larger fraction of observations does not belong to any of the pre-defined topic labels. As the results of the topic classification process is satisfactory in terms of the business requirement, it is worthy to attempt to build a topic detection framework that will enable the engine to detect emerging topics that are not already in the dataset. This is useful for the organization as it helps it to get an idea about the other topics customers have feedback about. This will increase and widen the quality of insight clients will gain from the feedback data. In the long run, topic categorization and detection will serve as a robust unified system that will adapt to the dynamic feedback generation landscape. The machine learning problem at hand in a latent topic detection process is an un- supervised learning problem. As detailed in section 2.4, it can be looked at as both a clustering or dimension reduction problem. After careful analysis of the techniques available for topic detection, LDA is the ideal language model to use for topic detection due to the reasons detailed in section 2.4.4.4. The primary objective of the project is to automate the topic detection process as much as possible. As there is already labeled data in abundance available from the Mturk classified dataset, there is an opportunity to use this labeled dataset to explore a potential automation strategy for parameter tuning. The experiment designed attempts to use the labeled data to tune the ideal regiment of parameters that can be used extract the topics from the data at hand. It goes on a further step to define a strategy to automatically identify strong/consistent topics within the dataset. The description in Experiment strategy will give a better idea about the strategy devised. 60
  • 74. Chapter 5. Topic Detection Automation 61 5.2 Experimentation Strategy There are several considerations when using LDA in topic detection. Feature extraction is an important component of it. Similar to techniques in classification, features are extracted in the form of tokens. Choosing parameters also plays a vital part in achiev- ing good results in LDA. As LDA addresses an unsupervised learning problem, tuning parameters is not as straightforward as a supervised learning problem where quantita- tive evaluation techniques are readily available. In the case of topic detection, a fairly popular technique is to manually assess the topic concepts being detected. 5.2.1 First Phase At the midst of available labeled data, there is an opportunity to use the statistical behavior of these labels to direct the parameter search in unsupervised case. Section (SECTION) detailing the method will discuss the techniques and rationalization in detail. The labeled dataset has 11 topics/labels including the “None of the above” observation set. Initially this dataset was used in first phase to tune the parameters to detect “some” of the labeled topics. This phase is helpful to understand the ideal parameters of the LDA that is appropriate for the statistical nature of the corpus at hand. The results of the topic inference are evaluated with the labeled dataset. The evaluation metric will be described in section 5.4.1 5.2.2 Second Phase The second phase focuses on tuning the threshold parameters that are important in detecting new topics consistently. For this, 10 distinct datasets are generated using the primary dataset where each dataset will contain the “None of the above” observations and observations belonging to one of 10 remaining labels. The phase of the experiment will attempt to train LDA models on each of the mini corpora. The objective is to extract the labeled topic from each dataset. 5.2.3 Third Phase Using the results from phase 2, a scoring function is built to measure the consistency of the strongest emerging topic. The same evaluation metric used in phase 2 is used to direct the experiment in the right path. The parameters such as specificity of the scoring function, consistency evaluation technique has to be tuned.
  • 75. Chapter 5. Topic Detection Automation 62 5.2.4 Fourth Phase The forth and final phase of the experiment will use the verified parameters to run topic detection on “None of the above” observation set only. Results will be manually assessed using human intelligence and reported. 5.3 Method Before initiating the major experiment, text has to be preprocessed. Due to the con- sistent results obtained in all classification experiments, preprocessing specification 111 outlined in section 3.6 is used for feedback preprocessing. After the text undergoes stan- dardization, spell correction and stemming, the transformed dataset is used to train the LDA with stop word removal. The final dataset being used for LDA is unlabeled, as labels will be discarded before training the model. 5.3.1 Phase 1: Tuning LDA parameters The primary focus on Phase 1 was to tune the parameters for Dirichlet priors in the dataset. The full dataset is used for this purpose. The parameters being tuned are, Eta (η): Eta is the hyper-parameter for the Dirichlet prior that influences the spar- sity of topic-word distribution. Bigger the Eta, densor the distribution. The popular hyperparameter value for this parameter is 1/Number of LDA Topics. In this experiement, grid search was carried out within the parameter range [ 0.001, 0.01,0.03, 0.05, 0.08, 0.09, 0.095, 0.099, 0.1, 0.115, 0.12, 0.13, 0.2, 0.5, 1.0 and 10.0 ] Alpha (α): Alpha is the hyper-parameter for the Dirichlet prior that influences the sparsity of document-topic distribution. Bigger the Alpha, densor the distribution. The popular hyperparameter value for this parameter is 1/Number of LDA Topics. In this experiement, grid search was carried out within the parameter range [ 0.001, 0.01,0.03, 0.05, 0.08, 0.09, 0.095, 0.099, 0.1, 0.115, 0.12, 0.13, 0.2, 0.5, 1.0 and 10.0 ] Number of Passes: number of iterations the LDA will run. The more iteration will tune the probability distributions more as each iteration step optimizes the dis- tribution parameters further. It is important to run enough number of passes in LDA to make sure that the final parameters have converged. Performance was evaluated for multiple number of passes such as 50, 70, 100 and 200
  • 76. Chapter 5. Topic Detection Automation 63 In phase 1, a multilevel grid search is used to train multiple models in parallel. The grid search will permute the above parameters to find the ideal set of parameters that regenerated the most of expected results. Once the models are built, the inference part of LDA is used to infer topic distributions for individual observations. This topic inference is used to evaluate the performance of the model. In simpler words, expected result is the model that generates the most number of topics from the initially labeled dataset. 5.3.2 Phase 2: Simulating an emerging topic The primary objective of the second phase of the experiment is to use the available labeled data to tune the threshold parameters for topic detection. The typical scenario is the one where a new topic starts trending in the dynamic feedback collection system. It is necessary to build a framework that automatically identifies this topic. According to the scenario outlined, following can be assumed. • The candidate data pool doesn’t belong to any of the pre-defined topics • Usually, the candidate data pool doesn’t contain multiple strong topics trending at the same time • Therefore the new emerging topic represents a fair fraction of the unclassified observations The best approach to simulate this scenario using the data at hand is to simulate the instance where each labeled topic is treated as a new trending topic within the unclas- sified observations. In order to simulate this phase, 10 distinct datasets are generated. Each Dataset contains the “None of the Above” observations and the observations be- longing to each of the remaining topics. The topic labels are discarded before training the models. The following table outlines the number of observations in each dataset As the ideal parameters for this phase has been already found using phase 1, the chosen parameter combination can be used to tune the LDA model in each dataset. 5.3.3 Phase 3: Developing a Topic Consistency scoring function The performance of the model is assessed using how well the final LDA model identifies the labeled topic in each dataset. Using the results, threshold parameters and the
  • 77. Chapter 5. Topic Detection Automation 64 Topic/Label Positive Documents Negative Documents Total Documents Delivery 2863 2863 5726 Images 2292 2292 4584 Latency 1848 1848 3696 MoreFunc 4387 4387 8774 Navigation 7627 7627 15254 Price 5520 5520 11040 Products 1826 1826 3652 Range 3092 3092 6184 Size 2078 2078 4156 Stock 1463 1463 2926 Table 5.1: The number of observations in each dataset final topic consistency scoring. The statistical behaviour of results from the phase 2 will be used to develop a scoring fucntion that will enable detecting consistent topics. Futhermore, the inference probabilities are used to plot histograms to understand the inference score distribution. This score is used to build an ideal scoring function that can identify consistent topics. The results are evaluated using precision and recall of the inferred documents with the labeled dataset. 5.3.4 Phase 4: Final topic detection The final phase attempts to use the parameters and scoring function investigated through phase 1 and 2 to detect emerging topics in the unclassified dataset. “None of the above” observation set is modeled using the LDA with the parameters chosen in phase 1. Once the topics are detected, the scoring function is used with the threshold parameters to detect the most consistent topic. The chosen topic and the observations inferred to belong to that topic are manually assessed. 5.4 Results Once the topic detection experiment is carried out in the mentioned phases respectively, the results were recorded and evaluated. The labeled dataset was used to evaluate the results of the first three experiments as they were conducted with the use of the labeled dataset. The final phase was a completely unsupervised phase that was qualitatively evaluated using human expert opinion.
  • 78. Chapter 5. Topic Detection Automation 65 5.4.1 Evaluation Metric As mentioned above, phase 1 and 2 of the experiment was designed with the use of the labeled dataset. The objective of the phases is to tune the parameters of the model to regenerate the results of the labeled dataset. That is to use machine learning to regenerate the results from the human intelligence task as much as possible. The best method to evaluate this is to compare with the labeled dataset itself. The final outcome of phase 1 is the same set of observations from the labeled dataset with topic inference probability values. If the inferred probability of a LDA generated topic is high for a fair set of observations that belong to the same topic in the labeled dataset, it is highly likely that the LDA generated topic resembles the manually labeled topic. The underlying assumption for such a claim is: • It is more likely for labeled topics to be detected as the dataset contains observa- tions that belong to pre defined labels • It is more likely for a labeled topic to emerge, as those observations are present in reasonable proportions in the dataset. • Given such a topic emerges as one of the topics in LDA, the observations that belong to that label in the labeled dataset is very likely to get higher inference probability values • Therefore most of the observations that belong to this particular topic will have high probability for the same inferred topic and low probability values for other topics. • Similarly, observations that do not belong to that label in the labeled dataset will have low probability values for the particular inferred topic • If an emerged LDA topic shows the above behavior, – It represents a topic that maps to one of the topics in the label dataset – It shows specificity as it only shows good inference probability for that par- ticular emerged topic – This is a potential candidate for a consistently emerged topic In order evaluate the above behavior; result-set from the LDA is transformed. Firstly, the results dataset is partitioned into 11 groups of observations where each group contains the observations that belong to the same topic/label in the labeled dataset. That is
  • 79. Chapter 5. Topic Detection Automation 66 done using the labeling in the labeled dataset. Once this is done, the mean inferred probability is computed for each group of observations for each LDA topic. Given there are: • N total number observations • L number of labels in the labeled dataset • K number of LDA inferred topics This will generate a N × K inference matrix with pn,k representing the probability of observation n belonging to LDA topic k. After partitioning this inference matrix into L partitions using the labeled dataset, the mean is computed for each LDA topic within the partition. Final outcome will generate a L × K matrix where pl,k will represent the mean probability of each observation in group l belonging to LDA topic k. Phase 2 will use the same evaluation metric outlined above. The only difference will be that 2 labels will be used instead of L. 1. None of the above: label that contains observations that do not belong to any of the pre-defined topic 2. One of remaining: label that contains observations belonging to one of the remain- ing topics The mean inference between observations inferred to those two labels will be assessed. 5.4.2 Phase 1 Results: All topics datasets This matrix will be plotted on a graph to investigate the topic consistency. Following plots depict how the most consistent set of parameters emerge replicating the result behavior described above. The topic references for the plots are as follows: The X-axis in the figures represent the labels found in labeled dataset. The Legend is given by the table below: The Y-axis, as the figures suggest, represents the mean “normalized” probability. As the topic inference module in LDA assesses the topic distribution for each observation independently, the probability values are independent. In this study, the probability
  • 80. Chapter 5. Topic Detection Automation 67 Ref Number Topic/Label 0 Delivery and Delivery cost 1 Images 2 Page loading time, errors (Latency) 3 More site functionality 4 Navigation, Search and Design 5 None of the above 6 Price 7 Products 8 Product Range 9 Size 10 Stock Availability Table 5.2: The Topic legend for X-axis in inference plots Figure 5.1: Topic Inference comparison at 100 passes
  • 81. Chapter 5. Topic Detection Automation 68 values are normalized in terms of the topic distribution per observation. That is, the sum of all normalized probability values for each observation adds up to 1.0. From figure 5.1, it is evident that 3 topics are emerging from the dataset. They are topic 5,7 and 12 respectively. Topic 5 seems to be having good inference probabilities on Images. Topic 7 aligns with stock and topic 12 with Delivery. Figure below shows the inference performance when the same sets of parameters were trained for 200 passes with the same dataset. More passes guarantee that the distribu- tions converge. Figure 5.2: Topic Inference comparison at 200 passes Figure 5.2 presents the topic distributions when the model was trained for 200 passes. It is evident that 4 topics are emerging in this run. Topics 6, 7, 8 and 9 are emerging from the dataset. The following figure details the resultant 17 topics that are being detected with LDA for analysis outlined in figure 5.2. By considering figure 5.2 and figure 5.3 to map the results together, there is strong evidence that the 4 emerging topics have a very good mapping to labels from the labeled datasets. Condiering the outcomes of ther parameter combinations, it is evident that alpha 0.003 and eta 0.5 is the most suitable combination of parameters. The topics detected map as the following table
  • 82. Chapter 5. Topic Detection Automation 69 Figure 5.3: 17 topics inferred in LDA analysis Topic in LDA model Topic/Label 6 Stock Availability 7 Size 8 Delivery and Delivery cost 9 Images Table 5.3: Mapping of LDA topics to Labeled topics As seen in table 5.3, the parameters are effective as it is able to detect 4/10 topics within the dataset. This is very promising as the objective of the parameter tuning experiment is to steer the unsupervised learning process into predefined results. Above results show promise that this parameter combination is ideal to model the word and topic distribution of the dataset at hand. 5.4.3 Phase 2 Results: Individual Datasets In this phase, 10 datasets are used to simulate the scenario where an individual topic is introduced to the dataset as a new emerging topic. LDA model is trained on the datasets separately to quantitatively analyze the results. The model is trained on the parameter set deemed to be successful in phase 1. The computing power required for this phase is significantly smaller in comparison to phase 1 as the grid search step is not
  • 83. Chapter 5. Topic Detection Automation 70 Dataset Topic Rank LDA topic Ref. precision recall Delivery 1 3 0.89 0.57 2 5 0.31 0.04 Images 1 1 0.86 0.55 2 4 0.74 0.24 Latency 1 0 0.87 0.50 2 4 0.44 0.12 MoreFunc 1 2 0.67 0.22 2 1 0.83 0.23 Navigation 1 1 0.67 0.30 2 5 0.72 0.24 Price 1 2 0.19 0.05 2 5 0.90 0.26 Products 1 3 0.50 0.14 2 4 0.66 0.26 Range 1 1 0.89 0.51 2 5 0.28 0.08 Size 1 5 0.80 0.61 2 2 0.19 0.02 Stock 1 2 0.79 0.61 2 0 0.58 0.09 Table 5.4: Summary of performance statistics of emerging topics in different datasets devised in this phase. Therefore, the models were trained with 300 passes/iterations to guarantee convergence of the probabilities. Visual plot is the most intuitive form of evidence suggesting this approach leading to good results. When plotting the visual plots, the labeled topics are plotted in alpha- betical order in X-Axis. The visual evidence is further backed with the topic list and the precision recall statistics of the comparison with labeled results. The following table summarizes the precision recall statistics obtained for the best emerging topic in each dataset The meaning of different column definitions are as follows: Table 5.4 shows the best 2 emerging topics for each dataset with their respective precision and recall performance results. The full experient results can be found in the Appendix. The following sections will describe some of the good result sets that were obtained from the afore-mentioned experiment.
  • 84. Chapter 5. Topic Detection Automation 71 5.4.3.1 Delivery The following figure represents the mean normalized probability obtained form Delivery dataset. This dataset contains observations that belong to Delivery label and None of the above label. Alphabetically ordered, X-axis is mapped as: 0: Delivery 1: None Figure 5.4: Topic Inference comparison at 300 passes in Delivery dataset Figure 5.4 gives strong evidence that topic 3 amongst other topics is strongly mapped to Delivery topic. This is because it shows a very high mean probability value for inference in the document collection belonging to Delivery label and a small mean probability for other documents. No other inferred topic shows this behavior either. Table below outlines the topic statistics for this model. Table 5.5 clearly shows that topic 3 has 89% precision and 56% recall compared to the rest of the candidate LDA topics that have 30% ish precision and 4%ish recall. This gives stronger evidence that topic 3 maps directly to the topic labeled as “Delivery” in the labeled dataset. Topic 3 detected using the LDA model is as follows:
  • 85. Chapter 5. Topic Detection Automation 72 LDA topic ref. precision recall Postive Score Negative Score 0 0.29 0.03 0.14 0.18 1 0.03 0.01 0.16 0.17 2 0.37 0.06 0.16 0.17 3 0.89 0.57 0.22 0.11 4 0.32 0.04 0.15 0.18 5 0.31 0.04 0.17 0.18 Table 5.5: Summary statistics of Delivery dataset 0.041*deliveri + 0.034*free + 0.022*ship + 0.016*store + 0.013*order + 0.011*pleas + 0.011*cheaper + 0.010*charg + 0.010*make + 0.009*deliv The above topic also clearly describes the keywords most common in observations related to deliveries. Therefore, one can be very confident that the topic detection model has worked very well for Delivery dataset. 5.4.3.2 Images The following figure represents the mean normalized probability obtained form Images dataset. This dataset contains observations that belong to Images label and None of the above label. Alphabetically ordered, X-axis is mapped as: 0: Images 1: None Figure 5.5 is strong evidence that there is a strong emergence in topic 1 to map to Images topic in the labeled dataset. The behavior is very similar to the Delivery topic presented in figure 5.4. This is strong evidence that emerged topic 1 depicts topic “Images”. Topic 1 detected using the LDA model is as follows: 0.018*pictur + 0.013*like + 0.013*shoe + 0.013*imag + 0.013*view + 0.013*prod- uct + 0.012*cloth + 0.012*color + 0.011*model + 0.011*item The above topic also clearly describes the keywords such as picture, image, view and colour that are most common in observations related to Images. Therefore, one can be very confident that the topic detection model has performed well with Images dataset.
  • 86. Chapter 5. Topic Detection Automation 73 Figure 5.5: Topic Inference comparison at 300 passes in Images dataset 5.4.3.3 Stock The following figure represents the mean normalized probability obtained form Stock dataset. This dataset contains observations that belong to Stock label and None of the above label. Alphabetically ordered, X-axis is mapped as: 0: None 1: Stock Figure 5.6 look reasonably different in comparison to 5.4 and 5.5. This is because the label None is plotted as 0 in X-axis here. Observing the plot shows that topic 2 has a very high mean probability inferred for documents that belong to class Stock in the labeled dataset. The same topic shows very small probability for topic None. All the other LDA topics show behavior similar to 5.4 and 5.5. This is strong evidence that LDA topic 2 in the plot resembles label “Stock”. Table below outlines the topic statistics for this model.
  • 87. Chapter 5. Topic Detection Automation 74 Figure 5.6: Topic Inference comparison at 300 passes in stock dataset LDA topic ref. precision recall Postive Score Negative Score 0 0.58 0.09 0.18 0.18 1 0.22 0.02 0.15 0.18 2 0.79 0.61 0.25 0.09 3 0.18 0.02 0.16 0.19 4 0.36 0.01 0.09 0.19 5 0.06 0.01 0.16 0.16 Table 5.6: Summary statistics of Stock dataset 5.4.3.4 Size The following figure represents the mean normalized probability obtained form Size dataset. This dataset contains observations that belong to size label and None of the above label. Alphabetically ordered, X-axis is mapped as: 0: None 1: Size
  • 88. Chapter 5. Topic Detection Automation 75 Figure 5.7: Topic Inference comparison at 300 passes in size dataset Figure 5.7 look reasonably different in comparison to 5.4 and 5.5. But the plot looks similart to figure 5.6. This is because the label None is plotted as 0 in X-axis here. Observing the plot shows that topic 5 has a very high mean probability inferred for documents that belong to class size in the labeled dataset. The same topic shows very small probability for topic None. All the other LDA topics show behavior similar to 5.4 and 5.5. This is strong evidence that LDA topic 5 in the plot resembles label “size”. One can see in phases 1 and 2 that the most consistent topic that maps to the labeled topics in labeled dataset leads to high inference probabilities in respective observations compared to the very low values for rest of records. Therefore, it is a fair attempt to use this behavior to distinguish between the consistent topic and the rest. Higher probability values gain more confidence on topic consistency. The underlying assumption for such a claim is: • A consistent topic will have a reasonable quantity of observations that belong to the topic. Therefore the number of documents that belong to the consistent topic is likely to be high in quantity. • Due to the unsupervised nature of the learning process, the inferred probabilities will not be extremely accurate in contrast to the labeled mapping
  • 89. Chapter 5. Topic Detection Automation 76 • Observations that belong to the consistent topic will have high inference probability over that topic. The inference probability of observations belonging to a consistent topic is more likely to be high • The inference probability for documents that do not belong to this topic is likely to be quite small. Due to these assumptions, the consistency scoring function should consider the following factors: • The probabilities of the documents belonging to a LDA detected topic • Specificity is important in selecting observations for scoring • The number of documents belonging to that topic • The probabilities of documents that do not belong to that topic By considering above factors, the ultimate consistency scoring function was designed. In order to formulate a scoring function, the following lists have to be generated first: TOPIC SCORES: That constains the inferred topic distribution for each observation in the dataset Then, • Consists of N number of rows for N number of observations • Consists of K number of columns for K number of LDA topics • Where Pn,k represents the inference probability of observation n belonging to LDA topic k The above phenomenon can be empirically observed by plotting the histograms of prob- ability values for topic datasets. From figures 5.8,5.9,5.10 and 5.11, one can clearly see that 0.9 is the ideal threshold to specify positive and negative documents. The documents gaining an inference probabil- ity that exceeds 0.9 is considered belonging to a LDA topic. The documents gaining an inference probability lower than 0.1 are considered not to belong to that topic. Then the probabilities can be used to score the topics. From the figures 5.85.95.105.11 it is evident
  • 90. Chapter 5. Topic Detection Automation 77 Figure 5.8: Topic Inference Histogram at 300 passes in Delivery dataset Figure 5.9: Topic Inference Histogram at 300 passes in Images dataset
  • 91. Chapter 5. Topic Detection Automation 78 Figure 5.10: Topic Inference Histogram at 300 passes in Size dataset Figure 5.11: Topic Inference Histogram at 300 passes in Stock dataset
  • 92. Chapter 5. Topic Detection Automation 79 that the most consistent topic has a large cumulative frequency (Area under the his- togram) for documents that belong to the topic (Positive documents). That means the most consistent topic manages to gain high frequencies of documents that obtain high probability values for that particular topic. Therefore, the sum of positive probabilities is a great indicator for topic consistency. In order to formulate a scoring function for topic consistency, the following data is used as input: • DATASET: consists of all the documents and their MTurk label • TOPIC DENSITY: that contains inference probability values for each topic for each document – Consists of L columns for each LDA topic – Consists of N rows for each document A scoring algorithm was developed using this mechanism. This algorithm is outlined in pseudo code. Algorithm 5.1 : select TrueDocs where Docs belonging to the labeled topics in DATASET foreach topic t in LDA model: Select TopicVector of probabilities for topic t from TOPIC DENSITY Select PositiveDocs where probability >= 0.9 Select NegativeDocs where proabability <= 0.1 Select TruePositives where (positiveDocs Intersection True Documents) normalise PositiveDocs in terms of Topic Vector normalise NegativeDocs in terms of TopicVector compute sum of scores for PositiveDocs transform NegativeDocs score with linear negative transoformation compute sum of scores for NegativeDocs compute precision for topic t compute Recall for topic t
  • 93. Chapter 5. Topic Detection Automation 80 store PositiveScore, NegativeScore for each topic t normalise PositiveScore for all topics select topic with highest PositiveScore select topic with second highest PositiveScore if the scores are 1 standard deviation apart: report highest topic as the consistent topic else: report both highest and second highest topics As algorithm 5.1 depicts, the score of the positive documents was used to automate selecting the consistent topic. The experiment is phase 3 was run with the threshold value of 0.9 with the scoring function focusing on the sum of scores. Frequency information is lost if the mean score is used as the score gets normalized inconsiderate of how many positive observations are present. Sum is suitable due to this reason. Table below outlines the results of 2 most consistent LDA topics for all 10 datasets. The python implementation of algorithm 5.1 is outlined below: Once topic automation was run on the 10 different datasets, the labeled topic was au- tomatically detected in 6/10 datasets. The labeled topic was selected within 2 most consistent topics in all 10 datasets. One can conclude that these results are very satis- factory for using the information from the same dataset to evaluate topic consistency. The ideal threshold for specificity is 0.9. The sum of scores is a good metric to assess topic consistency as it helps to represent both frequency and the score. 5.4.4 Phase 4 results: None of the above dataset When LDA is run on “None of the above” dataset, topics are derived and manually evaluated using the topics detected. The score statistics of the detected topics are outlined in the table below. It is clearly evident from Table 5.8 that topic 2 is the most consistent topic in the topic pool. Topic 2 detected using the LDA model is as follows: 0.036*model + 0.029*cloth + 0.014*fine + 0.014*look + 0.013*wear + 0.013*like + 0.012*item + 0.010*ok + 0.010*store + 0.009*better
  • 94. Chapter 5. Topic Detection Automation 81 Dataset Topic Rank LDA topic Ref. Positive1 Precision Difference Latency 1 0 0.23 86.94 ≤ 1 S.D. 2 3 0.20 16.67 Images 1 1 0.24 85.67 ≤ 1 S.D. 2 3 0.21 6.36 Delivery 1 3 0.22 89.32 > 1 S.D. 2 4 0.17 32.33 MoreFunc 1 2 0.19 66.74 ≤ 1 S.D. 2 1 0.18 83.33 Products 1 3 0.19 50.10 ≤ 1 S.D. 2 3 0.19 50.10 Size 1 5 0.24 80.54 > 1 S.D. 2 2 0.18 19.47 Range 1 1 0.22 89.20 ≤ 1 S.D. 2 4 0.20 17.40 Price 1 2 0.20 19.08 ≤ 1 S.D. 2 4 0.19 26.90 Navigation 1 1 0.22 67.46 > 1 S.D. 2 4 0.18 49.43 Stock 1 2 0.25 79.05 > 1 S.D. 2 0 0.18 57.39 Table 5.7: Summary of score statistics of emerging topics in different datasets LDA topic ref. Postive Score Negative Score 0 0.18 0.17 1 0.15 0.18 2 0.20 0.16 3 0.18 0.14 4 0.16 0.17 5 0.12 0.18 Table 5.8: Summary statistics of None of the Above dataset The top 10 observations inferred to fall into the above topic are as follows: 1. improv have a mini clip of a model wear the item of cloth on a runway so we can see how it look on also the way the item move 2. a pictur of a model wear each piec of cloth it s easier to judg how i d look on you and what size you should get 3. more detail about the product mayb a video of model wear so we can understand the textur and how the cloth sit on a bodi
  • 95. Chapter 5. Topic Detection Automation 82 4. i think you should add some model to display the real sight of the cloth wear on the bodi 5. would like to see the cloth on real peopl to get a better idea of how they look netaport is a fab exampl of thi 6. could mayb use real life model to model cloth to get a better idea of how garment would look 7. i like the fact that you can read other peopl s review about the cloth and there are lot of differ imag of the model includ a video 8. put video of model wear the cloth so the buyer know how it would look like in real 9. improv get new male model who want to look at bali autonom with a beard on a fashion site and that other guy that remind me of that other annoy actor daniel day lewi find some good look chap that look like they re in echo the funnymen or felt or someth cheer from detroit 10. i think it d be great if you could have the cloth on real peopl and mayb have a catwalk like the asp websit The most consistent topic suggested in phase 4 suggests that the topic is related to features. The top 10 observations provide more evidence to support this fact. Most of the observations talk about adding a model and a real life image of the product to enhance the experience. Technically this topic can be treated as a sub-topic under More Functions label. But this decision is highly subjective. Another observation from these results is that there are very similar feedback entries within the top 10 observations. It is highly likely that these feedback entries were collected from the same customer during his/her web session. According to how the feedback collection systems works (section 1.2.2), it is likely for a particular customer to be asked for feedback several times. 5.5 Conclusion From the emerging topic detection phase, there are few conclusions that can be drawn. Each phase in the multi-stage experiment focuses on achieving a particular goal. 1. Tune the LDA parameters for the dataset
  • 96. Chapter 5. Topic Detection Automation 83 2. Tune threshold parameters for emerging topic simulation 3. Derive the suitable scoring function for topic consistency 4. To detect an emerging topic from None of the above dataset Phase 1 has been very successful as the LDA model manages to detect 4/10 pre-labeled topics by tuning alpha and eta parameters. Alpha = 0.03 and Eta = 0.5 is the ideal combination of prior distribution hyper-parameters for the dataset at hand. Phase 2 and 3 also bring in successful results with 6/10 datasets managing to detect the most consistent emerging topic to be the pre-labeled topic for that dataset. The threshold for selecting negative and positive examples proves to be effective when the threshold is set to 0.9. The positive score based on the number of positive examples and inference probability also perform well. In the final phase, the new Topic “features” emerge from the “None of the above” dataset supported by observations talking about adding a real model and catwalk strongly supporting the topic. Although the topic “Features” can be categorized under “More functions” label, “Features” can be treated as a independent label as well. The final point will be discussed further in future work section. Overall, one can conclude that the results from all phases of the latent topic detec- tion experiments are promising and fruitful. These results give a promising avenue to automated emerging topic detection. It also provides a good approach to devising self-sufficient topic consistency measurement metrics that are independent of external corpora.
  • 97. Chapter 6 Conclusion and Future Directions 6.1 Introduction After careful reference to literature around the text analytics sphere, data was recorded, manually classified and transformed into carefully preprocessed corpora. Using these corpora, topic classification models and emerging topic detection pipelines were built and evaluated for different applications relating to the research problem at hand. After carrying out a full topic classification task and a 4-phased experiment on automating the emerging topic detection task, the main research findings are detailed below: • Initial sanity checks and trust modeling heuristics such as assessing employee re- liability can help clean up a crowd-sourced dataset. • Text pre-processing plays a vital role in achieving best results in topic classification. • Support Vector Machines (SVM) algorithm with Lasso regularization is a very good supervised learning technique to use for topic categorization. • Latent Dirichlet Allocation (LDA) is a very effective probabilistic language model to detect emerging topics in a text corpus. • Text labeling can be utilized as an effective tool to drive the parameter tuning in an unsupervised learning setting. 6.2 Trust modeling for cleaner data It is very important to understand the statistical behavior of a dataset if crowd sourcing is used for classifying data. The possibility of worker’s motivation to be centered on the 84
  • 98. Chapter 6. Conclusion and Future Directions 85 monetary reward can bias the worker to do more work in unit time compromising on the accuracy of resultant work. Therefore it is essential to device a method to keep track of worker reliability. Initial sanity checks have shown that the topic distribution of the resultant dataset is satisfactory. The analyses do not indicate any biases towards the UI structure of the data collection process. The concordance factor based on the ensemble effect/ majority voting is a very good tool to use in building heuristics around worker reliability. By making multiple workers to label the same observations, it is possible to automatically evaluate the correct labels to observations. Then this information can be used to assess the reliability of workers by comparing how different workers respond to these observations. By observing the results, one can conclude that the overall data collection process is well executed. More than 80% of the old feedbacks concord perfectly (100%) with the new ones. Results also suggest the following facts about the data collection phase: 1. Workers who perform a small number of jobs usually underperform 2. Most workers do better than 75% lifetime reliability score 3. Most workers tend to take about 5% of their lifetime to train the job (Burnout) 4. Time of day has no effect on the worker performance Taking above finding to considerations, following conclusions can be considered justifi- able in order to make the dataset cleaner. 1. Workers who haven’t completed more than 100 HITs are unreliable. 2. Removing the first 5% of the HITs from each worker will eliminate the worker training errors. 3. When multiple opinions are present for a single observation, the most concordant instance is most likely to be the correct one. As the dataset is generated using customer feedback, data is error prone. Therefore, one can conclude that text standardization and spell correction is necessary to restore standards and accuracy in the dataset. Reducing the words in to their root form will also increase accuracy by reducing multiple forms of the same word to one unique form. As Lemmatization requires more computing and memory resources, the more simpler and elegant stemming approach is ideal. It can also be concluded that the preprocessing sequence is very important for right execution of the data enrichment process. The right sequence also helps avoiding runtime exceptions due to logical errors.
  • 99. Chapter 6. Conclusion and Future Directions 86 6.3 Topic classification with SVM SVM is a great algorithm for topic classification as text spaces are mostly linearly sepa- rable. The modularity and extensibility of SVMs complements the suitability of SVMs to this project. After evaluating the results for topic classification, one can observe that specification 111 (text standardization + spell correction + word stemming) performs best in all experiments. Therefore it can be concluded that above specification is the ideal preprocessing pipeline for topic classification. From the experimental results, it is also evident that Unigram features perform best in topic classification. The Lasso regularization is suitable for Bag of Words feature spaces as it minimizes most weights to exactly zero. This is analogous to keyword selection for topics. According to results, the best model is the Unigram feature model with pre-processing specification 111. The ideal regularization weight for this model is when C=0.7. The hamming loss based accuracy for this model is 94%. The precision and recall in individual topics is also best in this model averaging to 80% precision and 60% recall. The linear space itself achieves satisfactory results in the classification task. Therefore other Kernel functions are unnecessary. The number of support vectors on each label is also very satisfactory in this model. Therefore one can confidently conclude that this is the ideal SVN model for topic classification in the dataset at hand. 6.4 Topic Detection with LDA Once classification of the topics is complete, emerging topic detection is an ideal exten- sion to this project. This is mainly due to the large fraction of unclassified observations in this dataset. (None of the above labels that don’t belong to any of the predefined topics). From empirical results, it is highly evident that using pre-labeled data to tune the parameters in a dataset is a very smart approach. The both phase 1 and phase 2 experiments show that using pre-defined topics to steer a LDA model is a very effective techniques to tune the hyper-parameters of the model. The LDA model in phase 1 manages to detect 4/10 topics in the dataset. Phase 2 manages to detect the emerging topic in 6/10 datasets where the scenario was simulated. From both the experiments, one could conclude that alpha = 0.03 and eta =0.5 are a good combinations of hyper- parameters for this corpus. These hyper-parameters are ideal to detect topics that inherit qualities such as word to topic weight similar to labeled topics. The results also show that the model converges by 200 passes. With the analysis and results from the histograms, it is evident that there are unique characteristics in the inference histogram that can be utilized to derive the most consistent topic. Therefore, it is also possible to conclude that inference probability and the frequency of high probability observations
  • 100. Chapter 6. Conclusion and Future Directions 87 are important factors in evaluating the consistency of a LDA topic. When using the inference probabilities to score topics, inference probability of 0.9 is the most suitable threshold for selecting observations that belong to a particular LDA topic. In the final phase of the experiment, the unclassified observations ‘(None of the above) were used to train an LDA model with 6 topics. The model accompanied by the de- veloped scoring function was able to detect the topic “Features”. The most scoring observations for this topic also provide further evidence that this topic talks about fea- tures of the website. Although one can objectively reason that the new emerged topic falls under the topic “More Functions” which is already labeled, one can also argue otherwise. This is highly subjective depending on the objective. However, a noteworthy observation from phase 2 of Latent topic detection experiments (Single emerging topic simulation) is that all 6 topics that successfully emerged from LDA simulations are very finely defined topic with a specific scope. Delivery, Images, Latency, Stock and Size refer to very specific and narrow scoped topics. On the other hand, More Functions, and Product are more vague topics represent more general topics with a wider scope. Therefore, it is possible to conclude that although LDA accompanied with the scoring function helps to find consistently emerging topics, having more specific topic with a narrow scope helps to achieve better results. This behavior is also evident in topic classification task. The topics that have a well scoped topics tend to have better precision compared to others. Finally, looking at all the results and findings, one can confidently conclude that the experiments undertaken under this project were well executed to obtain very satisfactory results. 6.5 Potential Future work The work-undertaken through this project has led to valuable insights in both within the topic classification and topic detection tasks. The results have shown that the dataset consists of fairly linearly separable text spaces. The detection phase has uncovered a lot of insight into how a partially labeled dataset can be used to steer the unsupervised learning task at hand. It unveils techniques to utilize available labeling to automatically tune the parameters of the unsupervised learning task without aimlessly searching for ideal parameter sets automatically.
  • 101. Chapter 6. Conclusion and Future Directions 88 6.5.1 Topic Classification There is potential work in classification that can complement the topic categorization and add more value to the feedback content. Sentiment analysis is a sensible future avenue to enrich the content. As a company focused on market analytics and conversion optimization, rather than just knowing what topics their customers are talking about, it is better to get an understanding about if the feedbacks are positive or negative. For example, it adds more value to their services to acknowledge their clients if the customers are expressing negative feedback on Delivery services than to say they are talking about Delivery services. It gives more a more actionable and descriptive picture. Sentiment analysis is a very rigorous research area that is being extensively investigated at present.(Wang and Manning, 2012, Glorot et al., 2011)Sentiment analysis also poses a lot of complex research questions compared to topic categorization as the word sequences and other grammatical features may deem significant. Therefore there is also potential to device String Kernels (Lodhi et al., 2002) to capture these features. The heuristics utilized in evaluating worker reliability can also be used in assessing the labeled dataset required for supervised sentiment analysis. Before moving towards topic detection, a noteworthy observation one can see in both topic classification and topic detection task is that the models tend not to perform very well with vague topics compared to more specific topics. This poses an interesting question if more vague topics have to be refined further. This is another worthwhile question that has to be investigated to improve performance of the model. As the topic classification task is a multi-label classification problem, it is also possible to evaluate the feasibility of using a structured topic models. (Taskar et al., 2003) This will enable building relationships between topics to get more insight into co-occurring topics in the dataset. 6.5.2 Topic Detection In terms of topic detection, there is a spectrum of creative work that can add more value and improve the topic detection process. By taking the vagueness problem into consideration, hierarchical topic models (Blei et al., 2004) can be considered for the latent topic detection task. This will enable building more precise topic models that can break down vague topics in to more specific topic concepts that are structured hierarchically. The topic detection results also suggest that there is a strong presence of topics rep- resenting sentiments that are being detected. These topics emphasize on the keywords
  • 102. Chapter 6. Conclusion and Future Directions 89 such as Love, Like, Good, Very, Everything and etc. . . Some evidence for this observation is found in: 1. Topic 16 in figure 5.3 2. Topic 0 in Delivery dataset : Appendix 3. Topic 2 in Latency dataset : Appendix 4. Topic 2 in Navigation and etc . . . It would be interesting to see how including words widely relating to sentiments in the stop words list would impact the outcome of the topic detection process. By doing this, the keywords that mainly associate with sentiments will be removed before the topic detection process starts. 6.5.3 Crowdsourcing ++ Throughout the thesis, crowdsourcing has been used very effectively with techniques and heuristics to refine results. A wonderful feature that is very useful in crowdsourcing information is the ensemble effect that enables finding the right answer via majority voting. In taking this work forward, this effect is a great tool to device in order to extend the topic detection framework. 6.5.3.1 Using crowdsourcing to label the emerging topics automatically Once a new topic is emerged from the dataset, majority voting can be used to choose a label for the new topic. As seen in section 5.4.4, the top scoring observations for the emerging topic relate to an abstract concept that can be described in one word. A good method to get crowd opinion on the topic label is listed below: 1. Compute inference for the most consistent topic within unclassified documents 2. Select the documents that exceed the inference probability P of belonging to that topic 3. Select N number of workers 4. For each worker, (a) Randomly select K number of observations from the high inference observa- tion set
  • 103. Chapter 6. Conclusion and Future Directions 90 (b) Present them to the worker via a MTurk HIT job (c) Ask the worker to respond by submitting one of word that explains the com- mon concept within the K observations An example scenario for this process is as follows: From the dataset, select the observations that score inference probability > 0.9 to the consistent topic. Select 10 workers and give them 3 randomly selected observations from above set. Then ask them to submit a word that describes all 3 observations. If this is done enough number of times, one word amongst multiple responses will emerge as the label for the topic. It is possible to investigate how to select the parameters P, N and K to automate this process. This study will automate the topic labeling process with minimal costs. 6.5.3.2 Automatically building the decision boundaries for the new topic Once the topic label is selected, the next objective is automatically classifying documents to the new label. A decision boundary is needed to do this classification. As there are no pre-labeled observations to train a classifier, the labeled data has to be generated. Automating this process will reduce the human effort involved in large scale. From the results outlined in section 5.4.3, the histograms in figures 5.8-5.11 show how the inference probabilities are distributed in the consistent topic. These results suggest a potential technique to automate the new label classification process. The process is outlined below: 1. Compute inference for the most consistent topic within unclassified documents 2. Select the documents that exceed the inference probability P of belonging to that topic as positive documents 3. Select the documents that have lower inference probability than (1−P) as negative documents 4. Use a semi-supervised technique such as Semi-Supervised Latent Dirichlet Alloca- tion (ssLDA) or Label Propagation to infer the labels of documents in between. From the results of the semi-supervised method, the observations that are close to the decision boundary can be identified. One can use crowdsourcing to classify these obser- vations to gain more confidence on the decision boundary. The ideal process is outlined below:
  • 104. Chapter 6. Conclusion and Future Directions 91 1. Compute confidence for each observation 2. Select observations with lowest confidence probabilities 3. For each observation: (a) Select a worker (b) Present him/her with the observation (c) Present the newly labeled topic as one of the choices (d) Present “None of the above” as one of the choices (e) List some of remaining labels as alternate choices (f) Ask worker to select he most appropriate label for the observation This process will allow the system to automatically classify doubtful observations near the decision boundary to get more information. Then the system can automatically update the model with new information. If collecting feedback from one worker is unreliable, it is possible to use the same majority voting technique by collecting multiple opinions. Research and investigation is necessary to understand how to implement such an au- tomated system successfully. If learned, such a system will immensely complement the text analytics engine that has been developed throughout this thesis. If successfully im- plemented, the complete system will not require any human intervention at all (Except for sanity checks that have to be carried out from time to time). Figure 6.1 shows how the whole ecosystem would work together as a cycle. The steps labeled green have been already developed through the thesis. The steps labeled black are the potential future work that will complete the self-managed text analytics en- gine. The potential engine will automatically adapt to newly emerging topics and grow continuiously with no human interventions.
  • 105. Chapter 6. Conclusion and Future Directions 92 Figure 6.1: Final System with future developments
  • 106. Chapter 7 A change of perspective. . . CISCO 7.1 Introduction Being offered to join Cisco Systems on a 12-month internship is surely a life-changing event. Having offered to it in San Jose, California in their headquarters is a dream too good to be true. UCL students are privileged to have this opportunity and pursue an international internship with Cisco Systems in Silicon Valley for one calendar year. Through this year, they will be assigned to one of the teams in Cisco and challenged as a normal employee to give them first hand experience in working in a large corporation. An offering like this is the perfect opportunity for a graduate student like me to Jumpstart my career. 7.2 Goals and Objectives Having finished the taught modules of my masters course in Computational Statistics and Machine Learning, my primary objectives from the Cisco internship were not solely limited to gaining more practical experience in machine learning and data science. There is a massive data center startup ecosystem in London and UCL graduates are always on demand. Having experienced working with several research internships while in London, I also had the thirst to gain full experience playing the role of a professional. My primary objectives of the internship were the following: • Career development • Industry experience • Experience Organizational culture 93
  • 107. Chapter 7. A change of perspective. . . CISCO 94 • Networking 7.2.1 Career Development As a recent graduate who would have to prepare for a career path, an organization like Cisco is a perfect place to get that training. The opportunity to work and train with experienced and talented teams gives you a very unique training that is very hard to get from a small or medium size organization. A company like Cisco gives you a very good framework to engineer within the well defined processes and management procedures that are vital to address the complexity of the work. There are pre-defined specifications for reports, processes, product life cycles and agile standards that are practiced everyday in office. This experience is very important to grow as a team player and to improve your development process. The work is also of a completely different scale from my previous experiences with budding startups. To gain and train in this new domain is one of the main objectives I chose Cisco. 7.2.2 Industry Expertise Cisco is a key player in the IT industry. Cisco is the trendsetter and global leader in network infrastructure. Having the chance to work in such a company and play a vital part in realizing their next generation products would be an opportunity that will give me heaps of experience. I was chosen to work in a Research and Advanced Development team to do research on Internet of Things (IoT). Internet of Things being a very recent and a booming field, I saw this opportunity to extend myself into a unique expertise at such an early stage of my career. Gaining domain expertise on using Machine Learning in IoT and cyber security, as a whole was one of the key objectives my internship at Cisco. 7.2.3 Learn from the organizational culture Having worked in several startups and research institutes before, I was very curious to experience and understand the corporate culture of a large corporation like Cisco. Companies like Cisco has to deal with a great amount of complexity when engineering products due to the heap of moving parts a that they have to consider when changing products and services. Due to the scale of work and human force involved, a more hierarchical organizational structure is also evident in large corporations. I was always curious to understand and work in such a setting to get a better understanding about handling large-scale projects with high impact broader scopes. One needs to develop
  • 108. Chapter 7. A change of perspective. . . CISCO 95 work ethics and deciding to coexist in a complex system like that. Another primary objective of my internship was to develop the work ethics, deciding and professionalism that will provide me a lot to grow as a professional in the industry. 7.2.4 Networking Cisco, Silicon Valley is a great place to meet and associate with like minds. Cisco it- self gives me exposure to many engineers, data scientists, managers and scientists that already actively take part in developing next generation products and services. In addi- tion to this, the Silicon Valley culture enables networking with various industry persons via the wide range of meet ups, hackathons, tradeshows, conferences and various other social events being hosted by the mass of tech companies in the area. Social interac- tions between engineers in the area is very rigorous and they all share technological and engineering interests alike. 7.3 Background Context At Cisco, I was assigned to one of the Chief Technology Officers’ teams. I was involved in the work for a whole calendar year under the Cisco International Internship Program (CIIP) 7.3.1 Cisco Cisco Systems, Inc. is a multinational organization headquartered in San Jose, Cali- fornia. Cisco primarily makes network infrastructure equipment and provides services around that domain. The main services provided by Cisco are based around secure col- laboration, data Centre management, cybersecity and communication. Cisco’ s business model is mainly focused on large scale enterprise customers and large organizations such as governments. Cisco operates with numerous engineering offices all over the globe with their presence in all corners of the globe. 7.3.2 CIIP Cisco International Internship Program (CIIP) is a one-year long internship program where students from 12 international partner universities are brought to the USA to live and work in the USA. Students are selected to the program via interviews. The candidates comprise of all academic levels from undergraduate levels to Doctoral level.
  • 109. Chapter 7. A change of perspective. . . CISCO 96 While the internship in USA, Cisco takes care of the flights, accommodation, wages and other formalities such as bank accounts, visa proceedings and etc. . . Interns are given an initial induction and then placed in their groups to work for one year as part of that functional unit. The most amazing part of this internship is that it provides a one year commitment to the team that enables the team to utilize the intern in long term high impact projects. The teams also see the interns more like a team member than an intern, which challenges the intern to live up to industry expectations. Apart from the technical assistance, the program also contributes to give the international interns a fully-fledged cultural experience in the USA by organizing different events from time to time. The nature of the system allows interns to deliver high impact outcomes (patents, research papers and etc. . . ) which positions the CIIP internship program amongst one of the most prestigious internship programs within Cisco. 7.4 My role and responsibilities At Cisco, I was assigned to the Office of CTO: Security Business Group (OCTO-SBG) as a Research and Advanced Development Intern. OCTO-SBG team is headed by the CTO of Security organization at Cisco. This team mainly focuses on driving the thought leadership and strategic development in the Security organization. The main outcomes of the team are Proof of Concept (PoC) prototypes that demonstrate the technical feasibility of new ideas. These PoC s are then reviewed by other vice president in the organization who sponsor them from their budgets for productization. The team consists of • Cisco fellows and Distinguished Engineers who are involved in thought leadership within the orgamization • Research and Advanced Development Engineers who help the distinguished engi- neers, to be be able transform their Proof of Concept Ideas to working prototypes • Chief of staff who manages the staff of CTO (which is the team) Throughout the internship I worked with one of the Distinguished Engineers in the team to develop data sensors and analytics for Internet of Things (IoT) Traffic. 7.4.1 My project Information Security and Threat defense has come a long way since Internet’s first launch in the late 1960s. The number of wireless devices themselves (not including mobile
  • 110. Chapter 7. A change of perspective. . . CISCO 97 Figure 7.1: The structure of OCTO-SBG
  • 111. Chapter 7. A change of perspective. . . CISCO 98 phones) which operate without human intervention (such as weather centers, smart meters, Point of Sale units etc. . . ) is expected to grow to 1.5 billion by 2014. (Lien et al., 2011) With the miniaturization of devices, the increase in computational power and the advances in efficient energy consumption, this trend tends to continue towards Internet of Things (IoT) . With this new trend towards leveraging smart access and control towards smaller network units such as mobile devices, sensor networks etc., industrial systems and technological sectors too research heavily on incorporating the opportunities bloomed by IoT in improving their business processes to maximize performance. With such technology advances, there is an explosion of devices that are coming on the Internet as “Everything” from medical wearables to power plant infrastructure are now in the network. Due to this reason, there is a strong need to address the security and scalability of both the devices and the Internet. In the evolving realm of IoT, Enterprise is constantly pushing the connectivity domains of network far beyond the conventional IP based networks to gain control and visibility into finer details of the IO devices and sensor networks. Manufacturing sector is one of the domains that is reaping very high return on investment through IoT. Cisco Internet of Everything (IoE) value index interprets the leading nations in IoE to have the longest track record in Machine to Machine (M2M) and Mobility innovations. 7.4.1.1 Problem Background Although extensive research and development is underway for security and Threat de- fense within the Enterprise and general Internet, the evolution of IoT creates a whole new domain of security related problems. As much as IoT facilitates new avenues to communication and control of machines and sensors, it also enables new forms of at- tacks.As such, the need for securing these communication channels and monitoring them is heightened to allay these potential new attack vectors. (Vyncke, 2013). Although the amount of extensive research going in to securing wireless IoT networks (Booysen et al., 2012), there is a strong need to give more attention towards securing M2M and Human to Machine (H2M) wired networks. The main goal of M2M communications is to enable the sharing of information between electronic systems autonomously. (Niyato et al., 2011) A Major challenge in the IoT domain in contrast to the Internet is the flexibility of the tools available within this space. The wide array of communication protocols that extend beyond the TCP/IP stack within the industrial domain itself makes it difficult to the currently available threat defense mechanisms to be applied in securing industrial networks.
  • 112. Chapter 7. A change of perspective. . . CISCO 99 Numerous feature engineering tools are being developed for threat defense in IP space for open source tools such as Snort. There is a high demand and necessity of extensive feature engineering for industrial protocols such as the SCADA and build threat defense models that can analyze and monitor network traffic in IoT enabled networks. The merge of the IP-Based Networks with the Non-IP networks also creates an opportunity to understand the relationships between the two levels of the network and validate using the relationships to predict the network behaviors. With data in abundance, there is a strong need to create a set of rules that can be used to • gain more visibility into IoT networks in the short run • Conduct Anomaly detection and threat defense in the long run 7.4.1.2 The goal of the research project The primary goal of this research is to understand the statistical behaviors of IoT traffic and build a threat defense framework around the traffic patterns of M2M “things”. The objectives of this research are as follows: • Understand the data requirement of the problem • Build appropriate sensors to capture the required data • Try to come up with creative ways to visualize the M2M data through the infor- mation extracted • Apply machine learning techniques to gain more visibility into the network Initially, the research is launched with a full review of all the literature about threat defense in IoT that has been published since IoT gained attention. All the threat models and solutions that have been suggested can be analyzed and reviewed to satisfy objective 1. Then get an understanding about Common Industrial Protocol (CIP), the primary network protocol in research. The appropriate packet capture data required for the study (objectives 2 through 4) can be collected using network protocol analyzer software such as Wireshark R 1 in a IoT network setting. When the network involves an IT component as well, the analyzer software will also allow capturing human application protocols such as SMTP (email). All packets can be captured in .pcap format that can be then converted to CSV and other useful formats using a tailor made data sensor built in C++. 1 https://www.wireshark.org
  • 113. Chapter 7. A change of perspective. . . CISCO 100 Figure 7.2: Goals of gaining insight into IoT traffic Once the sensor is built, data will be recorded and as an initial step, visualization plots will be built to gain immediate visibility to the IoT network. Supervised learning is used to build a model to predict the device type of different devices connected to the network (Water pumps, actuators, Programmable Logic Controllers and etc. . . ). As this project mainly deals with Internet of Things space, my mentor and I had to collaborate with several external parners of Cisco who directly vender product for en- terprise manufacturing spaces. Due to this reason, this project was protected by several Non-Disclosure and Cisco Confidential Agreements. Due to this reason, I am not in a position to elaborate much on the details of the project. 7.4.1.3 Results After carefully studying the CIP protocol specification, I pointed out that there are some important attributes in the protocol (features) that will allow us to understand the behaviors of devices. The following figure points out the packet dissection of the CIP protocol. We designed a data sensor that can capture these features from the network traffic data. A patent application is already filed for this method. Once the designing was finalized, I built the sensor using C++.
  • 114. Chapter 7. A change of perspective. . . CISCO 101 Figure 7.3: CIP Packet And the used the sensor to validate some sample traffic captures we had in the labs. By using the summary reports I built out of the sensor, we developed a visualization suite that allows an analyst • to see the different network connection in the factory plant, • to see the major traffic trends • to see the composition of the traffic Figure 7.4: Visualization application Furthermore, I was involved in building a machine learning based classifier to identify device types from the attributes of the packets. We formulated a method to fuse the operational layer data with Information layer data to derive labels for the device type.
  • 115. Chapter 7. A change of perspective. . . CISCO 102 Then we built a classifier to look at the operational layer data only (which is visible to the network) and classify devices. I am not allowed to allowed to elaborate on the technical details of the models. 7.4.1.4 Project outcomes The following are the outomes of the project Patent application: Method for providing IoE aware Netflow Patent application: Method for creating IoE behavioral models using Supervised Learning techniques Internal Whitepaper: IoT Sensing and Device Behaviour Modelling Technical Specification: Device classification in Indus. networks using Machine Learning Architectural Specification: Architecture for IoT aggregated Netflow for CIP protocol Report: IoT sensing App Development Report 7.5 Non technical aspects Apart from the technical experience I received at Cisco, working in the Silicon Valley has given me a brand new attitude towards data science and software industry. Silicon Valley operates in a completely different level to what I have witnessed with my experience in Sri Lanka and my brief work experience in London. Although there is a mass of technical vectors in California, the cultural and social experience is also noteworthy. American culture is quite unique in comparison to European and Asian cultures. It is a very diverse culture where everybody has a place to rest. People are very liberal and welcoming. Apart from the openness of people and their extraordinary admiration of freedom, the unique sports such as American Football and Ice hockey also complements the unique experience. The support and friendliness of people is also very heartwarming. Living in California with 70 more international interns who crave to experience the
  • 116. Chapter 7. A change of perspective. . . CISCO 103 foreign culture unveils a great opportunity to explore America while making new friends. You get the chance to explore your friends culture while you explore America together. Silicon Valley is the heart of technology in the USA. The majority of the community hear is involved in IT industry. There are so many social vectors that push your knowledge and experience in the Silicon Valley. Some examples are: • Univerisities • Meetups • Companies • Hackathons 7.5.1 Universities The primary workforce supplier to the silicon valley are located right in the valley itself. Stanford University is located in Palo Alto, a town considered to be part of the Silicon Valley itself. University of California in Berkeley (UC Berkeley) is right across the Millennium bridge. In terms of machine learning, these two universities have some of the most popular machine learning faculty in the world. Cisco itself build research partnerships with these institutions and these partnerships enable Cisco to gain knowledge in the latest research landscape and vice versa. 7.5.2 Meetups Similar to London, meetups are very popular in Silicon Valley. There is a very dynamic and ambitious crowd constantly pushing the community to learn and educate each other by sharing their experiences with technologies. There are numerous meetups related to machine learning itself that take place on weekly basis. These meetups are focused on Some of them are • SF Data Science Meetup • Bay Area Machine Learning Meetup • NLP Journal club • Spark Meetup • Bay Area NLP Meetups
  • 117. Chapter 7. A change of perspective. . . CISCO 104 • Bay Area Big Data Meetup Apart from machine learning and big data, there are other meetups that are very useful. Some of them are, • Cyber security Meetup • Bay Area IoT Meetup • Quantum computing Meetup This social setting is complemented with community spaces where engineering and build- ing new things are encouraged. Hacker Dojo Centre is a great example for this. Another amazing fact about these meetups is that you get to meet brilliant minds. Meeting Professor. Trevor Hestie in a meetups about Gradient Boosting Machines and having the privilege to ask him questions personally are some of the most memorable moments that wouldn’t have realized not for the international internship. 7.5.3 Companies Silicon Valley is one of the most rigorous Tech startup region. Most of the biggest players in tech industry are headquartered here. Due to this reason, there is a lot of promotions and events that are organized by these different companies to grow their image and attach talent. There are events organized for young tech enthusiasts and college graduates from time to time where these companies will share the sophistication of work they do. They educate young engineers to inspire them to work with them and to build up their image. These events are very exciting and knowledge wealthy at the same time. You get the opportunity to go talk and discuss technical details with the very engineers who built those products. It provided me the opportunity to understand how some companies use large scale data processing and machine learning frameworks such as Hadoop and Spark to cope up with the scale of work they do. It has personally swayed my interest to drive my research interests towards large scale machine learning and Large scale natural language processing. I have seen companies such as 0xData and Data bricks use Map-reduce frameworks such as H2O and Spark to build systems that can work with TBs of data on daily basis. these personal experiences have triggered a personal interest in my career prospect to work hard in growing in to an expert in large scale machine learning.
  • 118. Chapter 7. A change of perspective. . . CISCO 105 7.5.4 Hackathons Hackathons impact positively towards shaping a early stage professional. I have devel- oped a lot of skills relating to but not limited to, managing strict timelines, balancing between quality and quantity over time, teamwork, learning new things. They are a great opportunity to be creative and think out of the box with limited resources pro- vided. Hackathons also provide an excellent setting to learn and try new technologies. Apart from the technological uplift, you get the opportunity to meet new people with alike interests and see other people’s work. You also get the opportunity to show your creations to industry experts, entrepreneurs and investors to get their valuable feedback. 7.6 Benefits I have gained heaps of experience and expertise in both software engineering and machine learning by pursuing this project. They are both technical and non-technical. 7.6.1 Technical Benefits Cisco International Internship Program has helped me grow as a technologist in several ways. Some of the benefits are listed below: 7.6.1.1 Machine Learning and Large Scale data processing Though the work I carried out at Cisco, I have learned to think out of the conventional frame of machine learning and to be creative to get the work done. For instance, initial labeling of the data shouldn’t necessarily be a human intelligence task. The device label classification is a good example for this. Another valuable experience I got from my work at cisco is to work with Large Scale data processing as well. Cisco has the capacity to capture and use very large datasets that are very difficult to be managed in a personal computer. The datasets can span to TBs and sequential processing of this data will take weeks to finish. At this scale, large scale distributed paradigm is very a very effective. Having to work with very large datasets, I gained first-hand experience in using apache Spark and Apache Hadoop for distributed data processing.
  • 119. Chapter 7. A change of perspective. . . CISCO 106 7.6.1.2 Training to manage a large pool of computing resources Due to the massive scale of the work, a large spool of resources is required to handle the datasets and computations needed to process these dataset. I have trained myself to work in a Cisco Unified Computing System (UCS) blade servers that can spin around 20-40 virtual machines at the same time. I got the privilege to manage my own UCS server to • Spin Virtual Machines for – Web servers – Computing servers – To install cisco products used in Proof of Concept Demos – To build virtual clusters • Build virtual clusters to carry out distributed computing I have also learned how to manage my resources and maintain my computing resources with minimal support from Lab administrator. 7.6.1.3 Building Data Sensors Data doesn’t always come in the most usable form. The ability to do background research and build the pre-processing tools is a good skill to have. I have learned to build data sensors that can analyze row packet capture data and output records that contain sensible information that can be used for model building. Working with network traffic based anomaly detection at Cisco, I mastered the skill of using external libraries to build the right sensors that can transform row data into more meaningful record sets. During the process I learned a fair amount of C++ and also about libPcap, the C++ library to read row packet data. From this experience, I learned that being a data scientist doesn’t only demand you to gain domain knowledge, but also go an extra mile to carry out the engineering involved before the model building process starts. It has also given me experience and exposure to grow as a technologist who has more grasp of the engineering aspect as well. 7.6.1.4 Learning about Networking and Cyber-security and Internet of Things Cisco Systems mainly deals with networking infrastructure. OCTO-SBG team that I was placed in focused on cyber-security. My project was mainly focused on building machine
  • 120. Chapter 7. A change of perspective. . . CISCO 107 learning and analytics around enterprise IoT networks. Having to work in this domain, I had the opportunity to learn core networking concepts. I had to also learn about security including anomaly detection, classification and non-machine learning knowledge relating to types of threats, encryption, visibility and etc. . . Having a high focus on Enterprise IoT networks such as manufacturing lines, I had the opportunity to understand the protocol specifications in that space and to start the building data products around it starting from the scratch, building the sensors to capture the right data itself. Through this journey, I have filed 2 patent applications for the data sensor and the machine learning techniques used for analytics. I have also become one of the subject experts on Common Industrial Protocol (CIP) 7.6.1.5 Building data visualization techniques During my project in gaining visibility into IoT traffic, I was also involved in building an application to visualize enterprise traffic. I learned to use d3 JavaScript library to visualize the data I generated from packet captures. I had to read and understand the human element involved in the process. The most important part of security analytics is to present results in a analyst intuitive way. Therefore, I had to learn how to present the information extracted in the most simplest, yet information rich way. 7.6.2 Non-Technical benefits I also gained a heap of non-technical benefits during my project at cisco. Some of them can be described as follows: 7.6.2.1 Work Ethic and Discipline By working in a company like Cisco, I gained first-hand experience in how to operate in a large cooperation. Being in a setting where you are minimally managed, I got the opportunity to train myself to be professional and deliver what is expected of me with minimal supervision. While working in a very independent setting where employees can work from Remote office or home, it helped me to improve my work discipline to work in such as setting. Having my personal mentor and weekly meetings with my manager helped to always discuss my matters and progress under their guidance.
  • 121. Chapter 7. A change of perspective. . . CISCO 108 7.6.2.2 Personal Skills After moving to San Jose, I have had the privilege to associate and get to know a lot of brilliant minds both professionally and personally. I have had the opportunity to associate with them to share our technological, cultural and personal interests to complement our knowledge. Getting to spend a year with 70 exciting interns from all over the world coming from around 15 different nationalities itself has been an unforgettable experience. Furthermore, the team, fellow colleagues have grown to be my personal friends. I have also had the privilege of working with several distinguished engineers who constantly keep in touch with me. I am confident that the personal and professional relations I have grown through this year will help me both personally and professionally in the years to come. I have also gained a lot of insight to the latest trends in the Silicon Valley and it has had an immense impact on my career prospects. I have had great exposure and insight into large scale machine learning and grown a new interest in enhancing my skillset along this line. I have also played a lead role in shaping IoT related security products. As Internet of Things network normally capture a massive amount of data, I am confident my grown interest in large scale machine learning will complement more work in IoT. 7.7 Conclusion Cisco Systems is a great choice for an international internship as it will give a very valuable experience to fresh graduate by exposing him in to the large cooperation work setting. There is a lot of room to improve your software engineering process and machine learning skills. Cisco is a place where you can grow as a professional as you can train to discipline yourself with well defined process cycles and procedures that are put in place to manage the complexity of work. Being an intern at Cisco, I got the opportunity to train myself in many technical aspects apart from machine learning as I had to take part in building the data sensors and visualisation application. Cisco benefited me as I got the opportunity to use high performance computing resources and manage them for my work. Apart from Cisco, this internship has given me huge career boost with the knowledge and exposure I gained by vising Meetups, tradeshows and Hackathons in the Silicon Valley. I have also made a lot of friends and professional connections that will benefit me to do great things in future.
  • 122. Appendix A Appendix The Test results in Phase 2 of the LDA experiment Source Code 109
  • 123. Bibliography R. Meuth, P. Robinette, and D.C. Wunsch. Computational intelligence meets the netflix prize. In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Compu- tational Intelligence). IEEE International Joint Conference on, pages 686–691, June 2008. doi: 10.1109/IJCNN.2008.4633869. J.R. Spiegel, M.T. McKenna, G.S. Lakshman, and P.G. Nordstrom. Method and system for anticipatory package shipping, December 24 2013. URL http://www.google.com/ patents/US8615473. US Patent 8,615,473. K. K Ladha. The condorcet jury theorem, freespeech, and correlated votes. American Journal of Political Science, 36(3):617–634, August 1992. URL http://www.jstor.org/discover/10.2307/2111584?uid=3738456&uid=2&uid= 4&sid=21104114670901. A. Anastasi and S. Urbina. Psychological Testing. Prentice Hall, Upper Saddle River, NJ, USA, 7 edition, 1997. V. C. Raykar, S. Yu, Zhao L. H., G. H. Valadez, C. Florin, L. Bogoni, and L Moy. Learnign from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010. URL http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/publications/ raykar_JMLR_2010_crowds.pdf. Y. Bachrach, T. Minka, J. Guiver, and T Graepel. How to grade a test without knowing the answers - a bayesian graphical model for adap- tive crowdsourcing and aptitude testing. In 29th International Confer- ence on Machine Learning, Edinburgh , Scotland, UK, 2012. ICML. url :http://research.microsoft.com/apps/pubs/default.aspx?id=164692. M. Danilevsky. Beyond bag-of-words: N-gram topic models. J. B. Lovins. Development of a stemming algorithm. Translation and Computational Linguistics, 11(1):22–31, 1968. C. D. Paice. Another stemmer. SIGIR Forum, 24(3):56–61, 1990. 110
  • 124. Bibliography 111 M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. xapian.org. Stepping algorithms. http://xapian.org/docs/stemming.html. A. Rajaraman and J.D Ullman. Data mining. In Mining of Massive Datasets, pages 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452. F. Xia, T. Jicun, and L. Zhihui. A text categorization method based on local docu- ment frequency. In Fuzzy Systems and Knowledge Discovery, 2009. FSKD ’09. Sixth International Conference on, volume 7, pages 468–471, Tainjin, China, August 2009. IEEE. A. N. K. Zaman, P. Matsakis, and C. Brown. Evaluation of stop word lists in text re- trieval using latent semantic indexing. In Digital Information Management (ICDIM), 2011 Sixth International Conference on, pages 133–136, Melboiurne,QLD, Australia, September 2011. IEEE. D. E. Knuth. Seminumerical algorithms. In Art of Programming Vol. 2, page 694. Addison Wesley, 3 edition, 1998. Knuth also lists other names that were proposed for multisets, such as list, bunch, bag, heap, sample, weighted set, collection, and suite. Martin Reh´ak, Michal Pechoucek, Pavel Celeda, Jiri Novotn´y, and Pavel Minar´ık. CAM- NEP: agent-based network intrusion detection system. In Michael Berger, Bernard Burg, and Satoshi Nishiyama, editors, AAMAS (Industry Track), pages 133–136. IFAAMAS, 2008. ISBN 978-0-9817381-3-0. URL http://doi.acm.org/10.1145/ 1402795.1402820. Wikipedia. tf-idf. URL http://en.wikipedia.org/wiki/Tf%E2%80%93idf. retrived on : 12 February 2014. J. Ramos. Using tf-idf to determine word relevance in document queries. URL https:// www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf. Re- trived on 15 February 2014. D. M. Blei and Lafferty J. D. Topic models. URL https://www.cs.princeton.edu/ ~blei/papers/BleiLafferty2009.pdf. Retrived on 15 February 2014. Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In ACL (2), pages 90–94. The Association for Computer Lin- guistics, 2012. ISBN 978-1-937284-25-1. URL http://www.aclweb.org/anthology/ P12-2018. Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: maximum margin supervised topic models for regression and classification. In ICML, volume 382 of ACM International
  • 125. Bibliography 112 Conference Proceeding Series, page 158. ACM, 2009. ISBN 978-1-60558-516-1. URL http://doi.acm.org/10.1145/1553374.1553535. B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information Processing Systems, volume 16, 2003. J. Zhu, A. Ahmed, and E. P. Xing. Med lda : Maximum margin supervised topic models. Journal of Machine Learning Research, 13:2237–2278, 2012. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, September 1995. Boser, Guyon, and Vapnik. A training algorithm for optimal margin classifiers. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1992. Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, U.K., 2000. Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Technical Report LS VIII-Report, Universit¨at Dortmund, Dortmund, Germany, 1997. J. Kivinen, M. Warmuth, and P. Auer. Linear vs. logarithmic mistake bounds when few input variables are relevant. In The perceptron algorithm vs. winnow, 1995. Conference on Computational Learning Theory. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419– 444, 2002. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990. Thomas Hofmann. Probabilistic latent semantic indexing. pages 50–57, 1993. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4): 395–416, 2007. J. Shi and J. Malik. Normalized cuts and image segmentation. In Transactions on Pattern Analysis and Machine Intelligence, volume 22, pages 888–905. IEEE, 2000.
  • 126. Bibliography 113 A. Ng, M. Jordan, Y. Weiss, T. Dietterich, S. Becker, and Z Ghahramani. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems, 14:849–856, 2002. D. M. Blei and J. D. Lafferty. Correlated topic models. Advances in Neural Information Processing Systems, 18, 2006. D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004. David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, 2012. Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for topic models. 2009. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evalua- tion of topic coherence. In HLT-NAACL, pages 100–108. The Association for Com- putational Linguistics, 2010. ISBN 978-1-932432-65-7. 2012. URL https://www.zooniverse.org/. retrived on 19 November 2013. Berinsky, J. Adam, Huber, A. Gregory, Lenz, and S. Gabriel. Evaluating online labor markets for experimental research: Amazon.com’s mechanical turk. URL JournalistsResource.org. retrived on 18 June 2012. Paolacci, Gabriele, Chandler, Jesse, Ipeirotis, and Panos. Running experiments on amazon mechanical turk. 2010. Buhrmester, Michael, Kwang, Tracy, Gosling, and Sam. Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1):3–5, January 2011. John Joseph Horton and Lydia B. Chilton. The labor economics of paid crowdsourcing. CoRR, abs/1001.0627, 2010. Mark Summerfield. Rapid GUI programming with Python and Qt: the definitive guide to PyQt programming. Prentice Hall open source software development series. Prentice- Hall, 2008. V. R. Guido. Setl (was: Lukewarm about range literals). URL https://mail.python. org/pipermail/python-dev/2000-August/008881.html. Retrieved 13 March 2011. T. Peters. Pep 20- the zen of python, 2008. W. McKinney. Python for Data Analysis. O’Reilly Media, Inc, 2013.
  • 127. Bibliography 114 E. Bressert. SciPy and NumPy: an overview for developers. O’Reilly Media, Inc, 2012. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API design for machine learning software: experiences from the scikit-learn project. CoRR, 2013. Steven Bird, Ewan Klein, Edward Loper, and Jason Baldridge. Multidisciplinary in- struction with the natural language toolkit. January 01 2008. Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly & Associates, Inc., pub-ORA:adr, 2009. Jacob Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010. Radim ˇReh˚uˇrek and Petr Sojka. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. Matthew D. Hoffman, David M. Blei, and Francis R. Bach. Online learning for latent dirichlet allocation. In John D. Lafferty, Christopher K. I. Williams, John Shawe- Taylor, Richard S. Zemel, and Aron Culotta, editors, NIPS, pages 856–864. Curran Associates, Inc, 2010. Amazon. Understanding hit types, 2014. URL http://docs.aws. amazon.com/AWSMechTurk/latest/AWSMechanicalTurkRequester/Concepts_ HITTypesArticle.html. Retrieved 13 March 2011. AbiWord. Dictionaries, 2005. URL http://www.abisource.com/~fjf/InvisibleAnt/ Dictionaries.html. Retrieved 15 March 2011. Andrew Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance, 2004. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81. 145;http://www.robotics.stanford.edu/~ang/papers/icml04-l1l2.pdf. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. IJDWM, 3(3):1–13, 2007. URL http://dx.doi.org/10.4018/jdwm.2007070101.
  • 128. Bibliography 115 Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large- scale sentiment classification: A deep learning approach, 2011. URL http://hal. archives-ouvertes.fr/hal-00752091. Shao-Yu Lien, Kwang-Cheng Chen, and Yonghua Lin. Toward ubiquitous massive ac- cesses in 3GPP machine-to-machine communications. IEEE Communications Maga- zine, 49(4):66–74, 2011. URL http://dx.doi.org/10.1109/MCOM.2011.5741148. Marthinus J. Booysen, John S. Gilmore, Sherali Zeadally, and Gert-Jan van Rooyen. Machine-to-machine (M2M) communications in vehicular networks. TIIS, 6(2):529– 546, 2012. URL http://dx.doi.org/10.3837/tiis.2012.02.005. Dusit Niyato, Xiao Lu, and Ping Wang. Machine-to-machine communications for home energy management system in smart grid. IEEE Communications Magazine, 49(4): 53–59, 2011.