SlideShare a Scribd company logo
Sales Prediction
Analyzing Twitter Data
Abstract:
The Chevrolet Camaro is an American automobile manufactured by Chevrolet, Sixth generation car launched
on Feb 27 2016 in United States, yet to be launched in India. According to a recent report by SpeedLux, the 2016
Chevrolet Camaro is listed as the 4th most searched vehicle on Google this past year. Compared with last year's
line of Camaros from GM, this year's model is shorter, narrower, lower and lighter. Car Scoops believes consumers
will like these sportier physical changes.
So, it’s been quite some time this car has hit the road with its enhanced specifications and new design. The car has
been talk of the town since release and received lot of feedback from users via different social media sources. As I
mentioned in my previous deliverable, I choose to proceed with Twitter reviews and feedbacks for detecting the
sentiments of the posts as this means is highly used to share opinions. R programming has been used for
sentiment analysis.
Introduction:
The core idea of this paper is detecting and understanding how the audience responded to this sixth
generation car and do the sentiment analysis on the data captured part of tweets. As previously mentioned,
sentiment analysis is the process of determining the emotional tone behind a series of words used to gain an
understanding of the opinions and emotions expressed within an online platform. Sentiment analysis is used for
social media monitoring, tracking of products reviews, analyzing survey responses and in business analytics. The
ability to extract insights from social data is a practice that is being widely adopted by organizations across the
world. Machine learning makes sentiment analysis more convenient. I choose R programming to do the
sentiment analysis as it has sentiment R, RTextTools packages and the more general text mining package which
come handy in detailed analysis. Text analysis in R has been well recognized. tm package plays a bigger role in the
analysis. tm package is a framework for text mining applications within R. It did a good job for text cleaning
(stemming, delete the stop words etc.) and transforming texts to document-term matrix (dtm).
Data Analysis:
Before describing the steps involved in the analysis, below are the important packages which are essential in
the data analysis process:
 twitteR : Provides an interface to the Twitter web API
 ROAuth: Provides an interface to the OAuth 1.0 specification allowing users to authenticate via OAuth to the
server of their choice.
 plyr : Provides tools for Splitting, Applying and Combining Data. A set of tools that solves a common set of
problems: you need to break a big problem down into manageable pieces, operate on each piece and then put
all the pieces back together.
 Stringr: Simple, Consistent Wrappers for Common String Operations. A consistent, simple and easy to use set
of wrappers around the fantastic 'stringr' package.
 ggplot2 : Create Elegant Data Visualizations Using the Grammar of Graphics. A system for 'declaratively'
creating graphics, based on ``The Grammar of Graphics''. You provide the data, tell 'ggplot2' how to map
variables to aesthetics, what graphical primitives to use, and it takes care of the details.
 Httr: Tools for Working with URLs and HTTP. Useful tools for working with HTTP organized by HTTP verbs (GET(),
POST(), etc.). Configuration functions make it easy to control additional request components (authenticate(),
add_headers() and so on).
 Wordcloud: Plot a cloud of words shared comparing the frequencies of words across documents.
 Sentimentr: Calculate Text Polarity Sentiment t at the sentence level and optionally aggregate by rows or
grouping variable(s).
 SnowballC: An R interface to the C libstemmer library that implements Porter's word stemming algorithm for
collapsing words to a common root to aid comparison of vocabulary.
 Tm: The tm package offers functionality for managing text documents, abstracts the process of document
manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database
back-end support to minimize memory demands. An advanced meta data management is implemented for
collections of text documents to alleviate the usage of large and with meta data enriched document sets. tm
provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming,
or stopword deletion. Further a generic filter architecture is available to filter documents for certain criteria,
or perform full text search. The package supports the export from document collections to term-document
matrices.
 Tmap: Thematic maps are geographical maps in which spatial data distributions are visualized. This package
offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and
bubble maps.
 RColorBrewer: Provides color schemes for maps (and other graphics)
1. Building word cloud using R
A word cloud is a text mining method that allows us to highlight the most frequently used keywords in a
paragraph of texts. It is also referred to as a text cloud or tag cloud. Building word cloud is a powerful method
for text mining and, it adds simplicity and clarity. They are easy to understand, to be shared and are impactful.
Word clouds are visually engaging than a table data. The height of each word in this picture is an indication of
frequency of occurrence of the word in the entire text. For word cloud formation, we follow the below steps:
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Create a corpus from the collection of text files.
mydata <- read.csv("C:/Users/nagar/Desktop/Extras/R/camaro.csv")
mycorpus <- Corpus(VectorSource(mydata$text))
After reading the data from CSV file using the function read.csv into a variable mydata, Corpus is created. The text
is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we
only have one document).
Step 4: Create structured data from the text file
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, PlainTextDocument)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removeWords, stopwords(kind = "en"))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
The tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove
common stopwords like ‘the’, “we”. The information value of ‘stopwords’ is near zero due to the fact that they
are so common in a language. Removing this kind of words is useful before further analysis. For ‘stopwords’,
supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian,
portuguese, russian, spanish and swedish. Language names are case sensitive.
Here, I have also remove numbers and punctuation with removeNumbers and removePunctuation arguments.
Another important preprocessing step is to make a text stemming which reduces words to their root form. In
other words, this process removes suffixes from words to make it simple and to get the common origin. For
example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.
Step 5: Making the word cloud using the structured form of the data.
wordcloud(mycorpus, min.freq=3, max.words=Inf, width=1000,
height=1000, random.order=FALSE, Color=pal)
Arguments of the word cloud generator function :
words : the words to be plotted
freq : their frequencies
min.freq : words with frequency below min.freq will not be plotted
max.words : maximum number of words to be plotted
random.order : plot words in random order. If false, they will be plotted in decreasing frequency
rot.per : proportion words with 90 degree rotation (vertical text)
colors : color words from least to most frequent. Use, for example, colors =“black” for single color.
Word Cloud on Camaro tweets:
2. Classify Emotion and publish graph:
Classification of emotion in R programming is achieved by using the function classify_emotion. This
function helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy,
sadness, and surprise. For this, I am using naive Bayes classifier trained on Carlo Strapparava and Alessandro
Valitutti’s emotions lexicon.
Below is the detailed description of the steps involved in Emotion Classification:
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Prepare the text for sentiment analysis.
Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an
alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our
credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account
and set up developers account and have all the authentication keys ready for connection establishment. Here, I
have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the
data file into a variable for further process.
data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")
Step 4: Perform Sentiment Analysis.
class_emo = classify_emotion(data, algorithm = "bayes", prior = 1.0)
Among the 3 parameters specified in the above function, the first parameter is the name of the data file being
classified, the second parameter is the algorithm being used which in our case is Bayes. So, as mentioned above,
A string indicating whether to use the naive Bayes algorithm or a simple voter algorithm. The third parameter is
a numeric specifying the prior probability to use for the naive Bayes classifier.
emotion = class_emo[,7]
Emotion best fit is set in the above syntax. Returns an object of class data.frame with seven columns and one row
for each document e.g. anger, disgust, fear, joy, sadness, surprise.
emotion[is.na(emotion)] = "unknown"
Lastly, substitute NA's by "unknown" under this step.
Step 5: Create and Sort the data frame.
sent_df = data.frame(text=data, emotion=emotion, stringsAsFactors =
FALSE)
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the
properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The
input in the first parameter is our data file and the second parameter considers the emotion which has been
analyzed in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed
to data.frame it is as if each component or column had been passed as a separate argument. The data frame is
stored into sent_df variable which is being sorted by the below command.
sent_df = within(sent_df, emotion <- factor(emotion, levels =
names(sort(table(emotion), decreasing = TRUE))))
Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will
be returned as a vector of factor values. To change the order in which the levels will be displayed from their default
sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you
desire. The sorted data is again stored into the variable sent_df.
Step 6: Plotting the Statistics.
ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count..,
fill=emotion)) + scale_fill_brewer(palette = "Dark2") + labs(x="emotion
categories", y="number of tweets", title="classification based on emotion")
One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes
function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is
used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color
selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed
on the horizontal and vertical axis.
Below is the Emotion classification plot for the Camaro tweets:
3. Classify Polarity and publish graph:
A fundamental task in sentiment analysis is polarity detection: the classification of the polarity of a given
text, whether the opinion expressed is positive, negative or neutral. This approach uses a supervised learning
algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. It
uses an opinionated dataset to train the classifier, data processing techniques to pre-process the textual data and
simple rules for categorizing text as positive or negative.
I am using the naïve Bayes classifier to attempt to classify the sentences as positive or negative. As the
name suggests, this works by implementing a Naive Bayes algorithm. Basically, this algorithm tries to guess
whether a sentence is positive or negative by examining how many words it has in each category and relating this
to the probabilities of those numbers appearing in positive and negative sentences.
The steps involved in polarity classification is like Emotion classification excepting the usage of classify_polarity
function for classifying positive and negative text. The creation and sorting process using data frame class uses
polarity function in this case. Below are the steps involved in Polarity classification.
Step 1: Setting up the working Directory in RStudio
setwd("C:/Users/nagar/Desktop/Extras/R")
Step 2: Installing and loading the necessary packages.
Below is the list of packages required for Emotion classification. The functionality of each of these packages has
been explained under Necessary packages.
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(httr)
library(wordcloud)
library(sentiment)
Step 3: Prepare the text for sentiment analysis.
Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an
alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our
credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account
and set up developers account and have all the authentication keys ready for connection establishment. Here, I
have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the
data file into a variable for further process.
data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv")
Step 4: Perform Sentiment Analysis.
class_pol = classify_polarity(data, algorithm = "bayes")
In contrast to the classification of emotions, the classify_polarity function allows us to classify some text as positive
or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s
subjectivity lexicon; or by a simple voter algorithm.
polarity = class_pol[,4]
Polarity best fit is set in the above syntax.
Step 5: Create and Sort the data frame.
sent_df = data.frame(text=data, polarity=polarity, stringsAsFactors
= FALSE)
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the
properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The
input in the first parameter is our data file and the second parameter considers the polarity which has been analyzed
in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed to data.frame it is as
if each component or column had been passed as a separate argument. The data frame is stored into sent_df
variable which is being sorted by the below command.
sent_df1 = within(sent_df, polarity <- factor(polarity, levels =
names(sort(table(polarity), decreasing = TRUE))))
Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will
be returned as a vector of factor values. To change the order in which the levels will be displayed from their default
sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you
desire. The sorted data is again stored into the variable sent_df.
Step 6: Plotting the Statistics.
ggplot(sent_df1, aes(x=polarity)) + geom_bar(aes(y=..count..,
fill=polarity)) + scale_fill_brewer(palette = "Dark2") +
labs(x="polarity categories", y="number of tweets", title="classification
based on polarity")
One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes
function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is
used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color
selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed
on the horizontal and vertical axis.
Polarity graph of Camaro tweets:
Conclusions:
The insight from the data analysis process of the Chevrolet Camaro car from word cloud is that the highest
frequency of talk is about the engine, seat, speed, power, drive and few other aspects which have been captured
in the word cloud. Visual representation from word cloud on data tends to have an impact and generates interest
amongst the audience. For further analysis, it may stimulate more questions than it answers, but that’s a good
entry point to discussion. Based on the Emotional classification, we see that joy parameter stands out with big
margin followed by anger, surprise, sadness in relatively lesser proportion. Joy adds to positive outlook of the car
in the market, further analysis on this data can be carried using machine learning skills. Based on the Polarity
classification, highest proportion is under neutral parameter. It can be proven that specific classifiers such as
the Max Entropy and the SVMs can benefit from the introduction of a neutral class and improve the overall
accuracy of the classification. The other approach in the current case is estimating a probability distribution over
all categories. Since the data is clearly clustered into neutral, negative and positive language, it makes sense to
filter the neutral language out and focus on the polarity between positive and negative sentiments. Open source
software tools deploy machine learning, statistics, and natural language processing techniques to automate
sentiment analysis on large collections of data similar to Camaro review data.
As I mentioned in my previous deliverables, it was indeed a great learning process from R programming
perspective. I’m glad that I could pull together all the important aspects for sentiment analysis from several
different areas to work on one unified program. I believe, sentiment analysis is an evolving field with a variety of
use applications. Although sentiment analysis tasks are challenging due to their natural language processing
origins, much progress has been made over the last few years due to the high demand for it. Sentiment analysis
within microblogging has shown that Twitter can be seen as a valid online indicator of political sentiment. Tweets'
political sentiment demonstrates close correspondence to parties and politicians, political positions, indicating
that the content of Twitter messages plausibly reflects the offline political landscape.
References:
 Sentiment Analysis and Opinion Mining by Bing Liu
(http://www.cs.uic.edu/~liub/FBS/SentimentAnalysisandOpinionMining.html)
 Sentiment Analysis by Professor Dan Jurafsky (https://web.standford.edu/class/cs124/lec/ sentiment.pdf)
 S. Blair-Goldensohn, Hannan, McDonald, Neylon, Reis and Reynar 2008 – Building a Sentiment
Summarizer for Local Service Reviews (http://www. ryanmcd.com/papers/local_service_summ.pdf)  S.
Asur et al., “Predicting the Future With Social Media”,arXiv:1003.5699.
 R. Sharda et al., “Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks”, CiteSeerX
2002.
 http://www.businessinsider.com/apple-and-samsung-just-revealed-their-exact-us-sales-figuresfor-the-
first-ever-time2012-8
 https://www.coursera.org/learn/r-programming
 http://www.bigdatanews.com/profiles/blogs/learn-everything-about-sentiment-analysis-using-r
 Koweika A.,Gupta A.,Sondhi K.(2013).Sentiment analysis for social media. International Journal of
Advanced Research in Computer Science and Software Engineering
 Younggue B.,Hongchul L.(2012),”Sentiment analysis of Twitter audience: Measuring the positive or
negative influence”, Journal of the American Society for Information Science and Technology.
 http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r
 https://rpubs.com/cen0te/naivebayes-sentimentpolarity
 https://www.youtube.com/watch?v=oXZThwEF4r0

More Related Content

What's hot (16)

PDF
18 Khan Precis
Imran Khan
 
PDF
Malware analysis
Roberto Falconi
 
PPTX
Natural Language Processing in R (rNLP)
fridolin.wild
 
PDF
Semi-Automated Assistance for Conceiving Chatbots
Jean-Léon BOURAOUI - MBA, PhD
 
PPT
Chapter7
lopjuan
 
PPT
mailfilter.ppt
butest
 
PPTX
Text analytics in Python and R with examples from Tobacco Control
Ben Healey
 
PDF
Venice boats classification
Roberto Falconi
 
PPTX
Text Mining Infrastructure in R
Ashraf Uddin
 
PDF
C0211822
inventionjournals
 
PDF
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Vasileios Lampos
 
PDF
DS4G
Rich Heimann
 
PPTX
hands on: Text Mining With R
Jahnab Kumar Deka
 
PDF
Pbcbt an improvement of ntbcbt algorithm
ijp2p
 
PPTX
Deep learning based recommender systems (lab seminar paper review)
hyunsung lee
 
PDF
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
18 Khan Precis
Imran Khan
 
Malware analysis
Roberto Falconi
 
Natural Language Processing in R (rNLP)
fridolin.wild
 
Semi-Automated Assistance for Conceiving Chatbots
Jean-Léon BOURAOUI - MBA, PhD
 
Chapter7
lopjuan
 
mailfilter.ppt
butest
 
Text analytics in Python and R with examples from Tobacco Control
Ben Healey
 
Venice boats classification
Roberto Falconi
 
Text Mining Infrastructure in R
Ashraf Uddin
 
Inferring the Socioeconomic Status of Social Media Users based on Behaviour a...
Vasileios Lampos
 
hands on: Text Mining With R
Jahnab Kumar Deka
 
Pbcbt an improvement of ntbcbt algorithm
ijp2p
 
Deep learning based recommender systems (lab seminar paper review)
hyunsung lee
 
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 

Similar to Sales_Prediction_Technique using R Programming (20)

PDF
Data Science - Part II - Working with R & R studio
Derek Kane
 
DOCX
Twitter analysis by Kaify Rais
Ajay Ohri
 
PPTX
Chapter 2 Flutter Basics Lecture 1.pptx
farxaanfarsamo
 
PPTX
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
HaritikaChhatwal1
 
PPTX
R_L1-Aug-2022.pptx
ShantilalBhayal1
 
PPTX
Unit 3
Piyush Rochwani
 
DOCX
SURE Research Report
Alex Sumner
 
PDF
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
DOCX
employee turnover prediction document.docx
rohithprabhas1
 
PDF
PRELIM-Lesson-2.pdf
jaymaraltamera
 
PPTX
Twitter_Sentiment_analysis.pptx
JOELFRANKLIN13
 
PDF
SessionTen_CaseStudies
Hellen Gakuruh
 
PPTX
Text Analytics
Ajay Ram
 
PDF
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
PDF
IRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET Journal
 
PDF
Overview of Movie Recommendation System using Machine learning by R programmi...
IRJET Journal
 
PPTX
Data Science With R Programming Unit - II Part-1.pptx
narasimharaju03
 
PPTX
Data science with R Unit - II Part-1.pptx
narasimharaju03
 
PDF
Building API Powered Chatbot & Application using AI SDK (1).pdf
diliphembram121
 
PDF
C interview-questions-techpreparation
Kushaal Singla
 
Data Science - Part II - Working with R & R studio
Derek Kane
 
Twitter analysis by Kaify Rais
Ajay Ohri
 
Chapter 2 Flutter Basics Lecture 1.pptx
farxaanfarsamo
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
HaritikaChhatwal1
 
R_L1-Aug-2022.pptx
ShantilalBhayal1
 
SURE Research Report
Alex Sumner
 
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
employee turnover prediction document.docx
rohithprabhas1
 
PRELIM-Lesson-2.pdf
jaymaraltamera
 
Twitter_Sentiment_analysis.pptx
JOELFRANKLIN13
 
SessionTen_CaseStudies
Hellen Gakuruh
 
Text Analytics
Ajay Ram
 
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
IRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET Journal
 
Overview of Movie Recommendation System using Machine learning by R programmi...
IRJET Journal
 
Data Science With R Programming Unit - II Part-1.pptx
narasimharaju03
 
Data science with R Unit - II Part-1.pptx
narasimharaju03
 
Building API Powered Chatbot & Application using AI SDK (1).pdf
diliphembram121
 
C interview-questions-techpreparation
Kushaal Singla
 
Ad

Sales_Prediction_Technique using R Programming

  • 2. Abstract: The Chevrolet Camaro is an American automobile manufactured by Chevrolet, Sixth generation car launched on Feb 27 2016 in United States, yet to be launched in India. According to a recent report by SpeedLux, the 2016 Chevrolet Camaro is listed as the 4th most searched vehicle on Google this past year. Compared with last year's line of Camaros from GM, this year's model is shorter, narrower, lower and lighter. Car Scoops believes consumers will like these sportier physical changes. So, it’s been quite some time this car has hit the road with its enhanced specifications and new design. The car has been talk of the town since release and received lot of feedback from users via different social media sources. As I mentioned in my previous deliverable, I choose to proceed with Twitter reviews and feedbacks for detecting the sentiments of the posts as this means is highly used to share opinions. R programming has been used for sentiment analysis. Introduction: The core idea of this paper is detecting and understanding how the audience responded to this sixth generation car and do the sentiment analysis on the data captured part of tweets. As previously mentioned, sentiment analysis is the process of determining the emotional tone behind a series of words used to gain an understanding of the opinions and emotions expressed within an online platform. Sentiment analysis is used for social media monitoring, tracking of products reviews, analyzing survey responses and in business analytics. The ability to extract insights from social data is a practice that is being widely adopted by organizations across the world. Machine learning makes sentiment analysis more convenient. I choose R programming to do the sentiment analysis as it has sentiment R, RTextTools packages and the more general text mining package which come handy in detailed analysis. Text analysis in R has been well recognized. tm package plays a bigger role in the
  • 3. analysis. tm package is a framework for text mining applications within R. It did a good job for text cleaning (stemming, delete the stop words etc.) and transforming texts to document-term matrix (dtm). Data Analysis: Before describing the steps involved in the analysis, below are the important packages which are essential in the data analysis process:  twitteR : Provides an interface to the Twitter web API  ROAuth: Provides an interface to the OAuth 1.0 specification allowing users to authenticate via OAuth to the server of their choice.  plyr : Provides tools for Splitting, Applying and Combining Data. A set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together.  Stringr: Simple, Consistent Wrappers for Common String Operations. A consistent, simple and easy to use set of wrappers around the fantastic 'stringr' package.  ggplot2 : Create Elegant Data Visualizations Using the Grammar of Graphics. A system for 'declaratively' creating graphics, based on ``The Grammar of Graphics''. You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.  Httr: Tools for Working with URLs and HTTP. Useful tools for working with HTTP organized by HTTP verbs (GET(), POST(), etc.). Configuration functions make it easy to control additional request components (authenticate(), add_headers() and so on).  Wordcloud: Plot a cloud of words shared comparing the frequencies of words across documents.  Sentimentr: Calculate Text Polarity Sentiment t at the sentence level and optionally aggregate by rows or grouping variable(s).  SnowballC: An R interface to the C libstemmer library that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary.  Tm: The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for
  • 4. collections of text documents to alleviate the usage of large and with meta data enriched document sets. tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.  Tmap: Thematic maps are geographical maps in which spatial data distributions are visualized. This package offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and bubble maps.  RColorBrewer: Provides color schemes for maps (and other graphics) 1. Building word cloud using R A word cloud is a text mining method that allows us to highlight the most frequently used keywords in a paragraph of texts. It is also referred to as a text cloud or tag cloud. Building word cloud is a powerful method for text mining and, it adds simplicity and clarity. They are easy to understand, to be shared and are impactful. Word clouds are visually engaging than a table data. The height of each word in this picture is an indication of frequency of occurrence of the word in the entire text. For word cloud formation, we follow the below steps: Step 1: Setting up the working Directory in RStudio setwd("C:/Users/nagar/Desktop/Extras/R") Step 2: Installing and loading the necessary packages. Below is the list of packages required for Emotion classification. The functionality of each of these packages has been explained under Necessary packages. library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2) library(httr) library(wordcloud) library(sentiment) Step 3: Create a corpus from the collection of text files. mydata <- read.csv("C:/Users/nagar/Desktop/Extras/R/camaro.csv")
  • 5. mycorpus <- Corpus(VectorSource(mydata$text)) After reading the data from CSV file using the function read.csv into a variable mydata, Corpus is created. The text is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we only have one document). Step 4: Create structured data from the text file mycorpus <- tm_map(mycorpus, tolower) mycorpus <- tm_map(mycorpus, PlainTextDocument) mycorpus <- tm_map(mycorpus, removePunctuation) mycorpus <- tm_map(mycorpus, removeNumbers) mycorpus <- tm_map(mycorpus, removeWords, stopwords(kind = "en")) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, stemDocument) pal <- brewer.pal(8, "Dark2") The tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”. The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analysis. For ‘stopwords’, supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish. Language names are case sensitive. Here, I have also remove numbers and punctuation with removeNumbers and removePunctuation arguments. Another important preprocessing step is to make a text stemming which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”. Step 5: Making the word cloud using the structured form of the data. wordcloud(mycorpus, min.freq=3, max.words=Inf, width=1000, height=1000, random.order=FALSE, Color=pal) Arguments of the word cloud generator function : words : the words to be plotted freq : their frequencies min.freq : words with frequency below min.freq will not be plotted max.words : maximum number of words to be plotted
  • 6. random.order : plot words in random order. If false, they will be plotted in decreasing frequency rot.per : proportion words with 90 degree rotation (vertical text) colors : color words from least to most frequent. Use, for example, colors =“black” for single color. Word Cloud on Camaro tweets: 2. Classify Emotion and publish graph: Classification of emotion in R programming is achieved by using the function classify_emotion. This function helps us to analyze some text and classify it in different types of emotion: anger, disgust, fear, joy, sadness, and surprise. For this, I am using naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon. Below is the detailed description of the steps involved in Emotion Classification: Step 1: Setting up the working Directory in RStudio setwd("C:/Users/nagar/Desktop/Extras/R") Step 2: Installing and loading the necessary packages. Below is the list of packages required for Emotion classification. The functionality of each of these packages has been explained under Necessary packages. library(twitteR)
  • 7. library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2) library(httr) library(wordcloud) library(sentiment) Step 3: Prepare the text for sentiment analysis. Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account and set up developers account and have all the authentication keys ready for connection establishment. Here, I have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the data file into a variable for further process. data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv") Step 4: Perform Sentiment Analysis. class_emo = classify_emotion(data, algorithm = "bayes", prior = 1.0) Among the 3 parameters specified in the above function, the first parameter is the name of the data file being classified, the second parameter is the algorithm being used which in our case is Bayes. So, as mentioned above, A string indicating whether to use the naive Bayes algorithm or a simple voter algorithm. The third parameter is a numeric specifying the prior probability to use for the naive Bayes classifier. emotion = class_emo[,7] Emotion best fit is set in the above syntax. Returns an object of class data.frame with seven columns and one row for each document e.g. anger, disgust, fear, joy, sadness, surprise. emotion[is.na(emotion)] = "unknown" Lastly, substitute NA's by "unknown" under this step. Step 5: Create and Sort the data frame. sent_df = data.frame(text=data, emotion=emotion, stringsAsFactors = FALSE) The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The
  • 8. input in the first parameter is our data file and the second parameter considers the emotion which has been analyzed in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument. The data frame is stored into sent_df variable which is being sorted by the below command. sent_df = within(sent_df, emotion <- factor(emotion, levels = names(sort(table(emotion), decreasing = TRUE)))) Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values. To change the order in which the levels will be displayed from their default sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you desire. The sorted data is again stored into the variable sent_df. Step 6: Plotting the Statistics. ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette = "Dark2") + labs(x="emotion categories", y="number of tweets", title="classification based on emotion") One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed on the horizontal and vertical axis. Below is the Emotion classification plot for the Camaro tweets:
  • 9. 3. Classify Polarity and publish graph: A fundamental task in sentiment analysis is polarity detection: the classification of the polarity of a given text, whether the opinion expressed is positive, negative or neutral. This approach uses a supervised learning algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. It uses an opinionated dataset to train the classifier, data processing techniques to pre-process the textual data and simple rules for categorizing text as positive or negative. I am using the naïve Bayes classifier to attempt to classify the sentences as positive or negative. As the name suggests, this works by implementing a Naive Bayes algorithm. Basically, this algorithm tries to guess whether a sentence is positive or negative by examining how many words it has in each category and relating this to the probabilities of those numbers appearing in positive and negative sentences. The steps involved in polarity classification is like Emotion classification excepting the usage of classify_polarity function for classifying positive and negative text. The creation and sorting process using data frame class uses polarity function in this case. Below are the steps involved in Polarity classification. Step 1: Setting up the working Directory in RStudio setwd("C:/Users/nagar/Desktop/Extras/R") Step 2: Installing and loading the necessary packages. Below is the list of packages required for Emotion classification. The functionality of each of these packages has been explained under Necessary packages. library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2) library(httr) library(wordcloud) library(sentiment) Step 3: Prepare the text for sentiment analysis. Before starting the sentiment analysis, we need to keep our data file ready in the working directory. As an alternate, using R programming, we can also connect directly to Twitter and use OAuth for authorizing our credentials and proceed with the analysis, the mandate for this process is that we need to have Twitter account and set up developers account and have all the authentication keys ready for connection establishment. Here, I have downloaded the CSV data file and saved in the working directory mentioned in the above step. Loading the data file into a variable for further process.
  • 10. data <- read.csv("C:/Users/nagar/Desktop/Extras/R/Book1.csv") Step 4: Perform Sentiment Analysis. class_pol = classify_polarity(data, algorithm = "bayes") In contrast to the classification of emotions, the classify_polarity function allows us to classify some text as positive or negative. In this case, the classification can be done by using a naive Bayes algorithm trained on Janyce Wiebe’s subjectivity lexicon; or by a simple voter algorithm. polarity = class_pol[,4] Polarity best fit is set in the above syntax. Step 5: Create and Sort the data frame. sent_df = data.frame(text=data, polarity=polarity, stringsAsFactors = FALSE) The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software. The input in the first parameter is our data file and the second parameter considers the polarity which has been analyzed in the above steps. stringsAsFactors is false meaning, If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument. The data frame is stored into sent_df variable which is being sorted by the below command. sent_df1 = within(sent_df, polarity <- factor(polarity, levels = names(sort(table(polarity), decreasing = TRUE)))) Here, we use factor function to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values. To change the order in which the levels will be displayed from their default sorted order, the levels= argument can be given a vector of all the possible values of the variable in the order you desire. The sorted data is again stored into the variable sent_df. Step 6: Plotting the Statistics. ggplot(sent_df1, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette = "Dark2") + labs(x="polarity categories", y="number of tweets", title="classification based on polarity") One of the very important function in R is ggplot which enables us to create a very wide range of useful plots. aes function has list of name value pairs giving aesthetics to map, in the current case it is emotion. The bar geom is used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y as shown below. color
  • 11. selection can change with one of scale functions such as scale_fill_brewer . X and y Specifies the variables placed on the horizontal and vertical axis. Polarity graph of Camaro tweets: Conclusions: The insight from the data analysis process of the Chevrolet Camaro car from word cloud is that the highest frequency of talk is about the engine, seat, speed, power, drive and few other aspects which have been captured in the word cloud. Visual representation from word cloud on data tends to have an impact and generates interest amongst the audience. For further analysis, it may stimulate more questions than it answers, but that’s a good entry point to discussion. Based on the Emotional classification, we see that joy parameter stands out with big margin followed by anger, surprise, sadness in relatively lesser proportion. Joy adds to positive outlook of the car in the market, further analysis on this data can be carried using machine learning skills. Based on the Polarity classification, highest proportion is under neutral parameter. It can be proven that specific classifiers such as the Max Entropy and the SVMs can benefit from the introduction of a neutral class and improve the overall accuracy of the classification. The other approach in the current case is estimating a probability distribution over all categories. Since the data is clearly clustered into neutral, negative and positive language, it makes sense to filter the neutral language out and focus on the polarity between positive and negative sentiments. Open source software tools deploy machine learning, statistics, and natural language processing techniques to automate sentiment analysis on large collections of data similar to Camaro review data.
  • 12. As I mentioned in my previous deliverables, it was indeed a great learning process from R programming perspective. I’m glad that I could pull together all the important aspects for sentiment analysis from several different areas to work on one unified program. I believe, sentiment analysis is an evolving field with a variety of use applications. Although sentiment analysis tasks are challenging due to their natural language processing origins, much progress has been made over the last few years due to the high demand for it. Sentiment analysis within microblogging has shown that Twitter can be seen as a valid online indicator of political sentiment. Tweets' political sentiment demonstrates close correspondence to parties and politicians, political positions, indicating that the content of Twitter messages plausibly reflects the offline political landscape. References:  Sentiment Analysis and Opinion Mining by Bing Liu (http://www.cs.uic.edu/~liub/FBS/SentimentAnalysisandOpinionMining.html)  Sentiment Analysis by Professor Dan Jurafsky (https://web.standford.edu/class/cs124/lec/ sentiment.pdf)  S. Blair-Goldensohn, Hannan, McDonald, Neylon, Reis and Reynar 2008 – Building a Sentiment Summarizer for Local Service Reviews (http://www. ryanmcd.com/papers/local_service_summ.pdf)  S. Asur et al., “Predicting the Future With Social Media”,arXiv:1003.5699.  R. Sharda et al., “Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks”, CiteSeerX 2002.  http://www.businessinsider.com/apple-and-samsung-just-revealed-their-exact-us-sales-figuresfor-the- first-ever-time2012-8  https://www.coursera.org/learn/r-programming  http://www.bigdatanews.com/profiles/blogs/learn-everything-about-sentiment-analysis-using-r  Koweika A.,Gupta A.,Sondhi K.(2013).Sentiment analysis for social media. International Journal of Advanced Research in Computer Science and Software Engineering  Younggue B.,Hongchul L.(2012),”Sentiment analysis of Twitter audience: Measuring the positive or negative influence”, Journal of the American Society for Information Science and Technology.  http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r  https://rpubs.com/cen0te/naivebayes-sentimentpolarity  https://www.youtube.com/watch?v=oXZThwEF4r0