SlideShare a Scribd company logo
Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science Serbia
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science Serbia
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
Text-Mining in R + Binomial
Logistic Regression
• Part 1: Basics of Text-mining in R
with the tm package
• Web-scraping w. tm.plugin.webmining
• tm package corpora structures
• Metadata and content
• Text transformations
• Term-Document Matrix extraction with
tf-idf weighting
• Part 2: Introduction to Binomial
Logistic Regression
• Binomial Logistic Regression from
Generalized Linear Models with glm() in
R
• Basic model assessment
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
# libraries
library(tm)
library(tm.plugin.webmining)
# source: Google Finance
# search queries: .com vs. hardware companies
searchQueries <-
list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'),
c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM'));
Intro to Text-Mining in R: tm + tm.plugin.webming packages
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
# retrive w. tm.plugin.webmining
# retrieve news for dotCom
dotCom <- lapply(searchQueries[[1]], function(x) {
WebCorpus(GoogleFinanceSource(x))
});
# now retrieve news for hardware companies
hardware <- lapply(searchQueries[[2]], function(x) {
WebCorpus(GoogleFinanceSource(x))
});
Retrieving GoogleFinance from tm.plugin.webmining
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# source: Google News
searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"),
c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
# retrieve news for dotCom companies
dotComNews <- lapply(searchQueriesNews[[1]], function(x) {
googleNewsSRC <- GoogleNewsSource(x,
params = list(hl = "en",
q = x,
ie = "utf-8",
num = 30,
output = "rss"))
WebCorpus(googleNewsSRC)
});
Retrieving GoogleNews from tm.plugin.webmining
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# tm corpora are lists
# a tm corpus (WebCorpus, more specifically)
dotCom[[1]]
# a PlaintTextDocument in the first dotCom corpus
class(dotCom[[1]][[1]])
dotCom[[1]][[1]]
# document metadata
dotCom[[1]][[1]]$meta
# document content
dotCom[[1]][[1]]$content
# let's add another tag the document metadata structure: dotCom
dotCom <- lapply(dotCom, function(x) {
x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply
meta(doc, "category") <- "dotCom"
return(doc)
})
})
dotCom[[1]][[1]]$meta
tm corpora: metadata and content
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# add a category tag to the document metadata structure: dotComNews
dotComNews <- lapply(dotComNews, function(x) {
x <- tm_map(x, function(doc) {
meta(doc, "category") <- "dotCom"
return(doc)
})
})
dotComNews[[1]][[1]]$meta
Combining tm corpora
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# combine corpora w. do.call() and c()
dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply
# Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html
hardware <- do.call(c, hardware)
dotComNews <- do.call(c, dotComNews)
hardwareNews <- do.call(c, hardwareNews)
workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
#### Part II Text Preprocessing
### text pre-processing
workCorpus[[1]]$content # we need to clean-up the docs
# reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
# Regex in R
removeSpecial <- function(x) {
# replacing w. space might turn out to be handy
x$content <- gsub("t|r|n"," ",x$content)
return(x)
}
# example:
cleanDoc <- removeSpecial(workCorpus[[1]])
cleanDoc$content
# removeSpecial with tm_map {tm}
workCorpus <- tm_map(workCorpus, removeSpecial)
workCorpus[[1]]$content
Text Preprocessing w. tm
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### {tm} by the book text pre-processing
# remove punctuation
workCorpus <- tm_map(workCorpus, removePunctuation) # built in
workCorpus[[1]]$content
# remove numbers
workCorpus <- tm_map(workCorpus, removeNumbers) # built in
workCorpus[[1]]$content
# all characters tolower
toLower <- function(x) {
x$content <- tolower(x$content)
return(x)
}
workCorpus <- tm_map(workCorpus, toLower)
workCorpus[[1]]$content
# remove stop words
workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in
workCorpus[[1]]$content
Typical preprocessing steps
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# stemming
# see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
library(SnowballC) # nice stemming algorithm (Porter, 1980)
# see: https://cran.r-project.org/web/packages/SnowballC/index.html
workCorpus <- tm_map(workCorpus, stemDocument)
workCorpus[[1]]$content
# remove potentially dangerous predictors: company names
# Why? Why? (In a nutshell: we are trying to discover something here,
# not to 'confirm' our potentially redundant knowledge)
predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba",
"HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
predWordsRegex <- paste(stemDocument(predWords),collapse="|");
replacePreds <- function(x) {
x$content <- gsub(predWordsRegex,
" ",
x$content)
return(x)
}
workCorpus <- tm_map(workCorpus, replacePreds)
workCorpus[[1]]$content
Stemming
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### Part III Feature Selection: Term-Document Frequency Matrix (TDM)
# TDM
# rows = terms; columns = docs
# we will use the td-idf weighting
# see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
dT <- TermDocumentMatrix(workCorpus,
control = list(tolower = FALSE,
wordLengths <- c(3, Inf),
weighting = weightTfIdf)
)
# dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package
# https://cran.r-project.org/web/packages/slam/index.html
class(dT)
dT$i # rows w. non-zero entries
dT$j # columns w. non-zero entries
dT$v # non-zero entry in [i,j]
dT$nrow # "true" row dimension
dT$ncol # "true" column dimension
dT$dimnames$Terms # self-explanatory
dT$dimnames$Docs # self-explanatory
Term-Document Matrix from tm: tf-idf weighting
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# remove sparse terms
docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1]
dim(docTerm)
docTerm$dimnames$Terms
# see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms
# A term-document matrix where those terms from x are removed which have
# at least a sparse percentage of empty (i.e., terms occurring 0 times in a document)
# elements. [1]
# NOTE: **very important**; in a real-world application
# you would probably need to run many models with features obtained from various levels of sparcity
# and perform model selection.
Sparse terms
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# as.matrix
docTerm <- as.matrix(docTerm)
colnames(docTerm) <-
paste0("d",seq(1:dim(docTerm)[2]))
# pick some words w
w <- sample(1:length(tfIdf),4,replace = F)
par(mfcol = c(2,2))
for (i in 1:length(w)) {
hist(docTerm[w[i],],
main = paste0("Distribution of
'",names(tfIdf)[w[i]],"'"),
cex.main = .85,
xlab = "Tf-Idf",
ylab = "Count"
)
} # quite interesting, isn't it? - How do you
perform regression with these?
The distribution of tf-idf scores
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# How well can the selected words predict the
document category?
# How to relate continuous, non-normally
distributed predictors, to a categorical outcome?
# Idea: Logistic function
par(mfcol=c(1,1))
logistic <- function(t) {exp(t)/(1+exp(t))}
curve(logistic, from = -10, to = 10, n = 1000,
main="Logistic Function",
cex.main = .85,
xlab = "t",
ylab = "Logistic(t)",
cex.lab = .85)
# and then let t be b0 + b1*x1 + b2*x2 + ... +
bn*xn
Logistic function
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# prepare data set
dataSet <- data.frame(t(docTerm))
dataSet$Category <- as.character(meta(workCorpus, tag="category"))
table(dataSet$Category)
# does every word play at least some role in each category?
w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w1]
w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w2]
# recode category
library(plyr)
dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0)))
Prepare Binomial Logistic Model
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### Logistic Regression
# Binomial Logistic Regression: use glm w. logit link (link='logit' is default)
bLRmodel <- glm(Category ~.,
family=binomial(link='logit'),
control = list(maxit = 500),
data=dataSet)
sumLR <- summary(bLRmodel)
sumLR
# Coefficients
# NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)]
# logit(p) = log(p/(1-p)) - also known as log-odds
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
# NOTE: Wald statistic (z) is dangerous: as the coefficient gets higher, its standard error
# inflates thus underestimating z; Beware of z...
Binomial Logistic Regression
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# plot coefficients {ggplot2}
library(ggplot2)
plotFrame <- coefLR[w,]
plotFrame$Estimate <-
round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-
plotFrame$Estimate), ]
plotFrame$Features <-
factor(plotFrame$Features, levels =
plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x =
plotFrame$Features, y = plotFrame$Estimate)) +
geom_line(group=1) + geom_point(color="red",
size=2.5) + geom_point(color="white", size=2) +
xlab("Features") + ylab("Regression
Coefficients") +
ggtitle("Logistic Regression: Coeficients (sig.
Wald test)") +
theme(axis.text.x = element_text(angle=90))
Coefficients
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)),
col="red",
border="black")
Fitted probabilities
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# coefficients related to odds (and not log-odds): simply
exp(bLRmodel$coefficients)[w]
# check max
max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
#### Reminder: Maximum Likelihood Estimation
normData <- rnorm(10000, mean = 5.75, sd = 1.25)
normLogLike <- function(params,x) {
mean <- params[1]
sd <- abs(params[2]) # dnorm generates NaNs if sd < 0
dens <- dnorm(x,mean,sd)
w <- which(dens==0)
dens[w] <- .Machine$double.xmin
return(-(sum(log(dens)))) # negative logLike, for minimization w. optim()
}
# ML estimation
# random initial values
startMean <- runif(1,-100,100)
startSd <- runif(1,-100,100)
mlFit <- optim(c(startMean, startSd),
fn = normLogLike,
x=normData,
control = list(maxit = 50000))
# ML estimates
mlFit$par
# cmp. true paramaters: mean = 5.75, sd = 1.25
Reminder: Maximum-Likelihood Estimation
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# model Chi-Square
CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution
CSq
dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom
dfCSq
# Chi-Square significance test in R:
pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance)
pCSq
# AIC = Akaike information criterion (-2*LogLikelihood+2*k, k = num.parameters)
bLRmodel$aic
Χ2 and Akaike Information Criterion
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### w. training vs. test data set
# split into test and training
dim(dataSet)
choice <- sample(1:475,250,replace = F)
test <- which(!(c(1:475) %in% choice))
trainData <- dataSet[choice,]
newData <- dataSet[test,]
# check!
sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training
sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test
# Binomial Logistic Regression: use glm w. logit link
bLRmodel <- glm(Category ~.,
family=binomial(link='logit'),
control = list(maxit = 500),
data=trainData)
sumLR <- summary(bLRmodel)
sumLR
Training and test data
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)),
col="red",
border="black")
# Prediction from the model
predictions <- predict(bLRmodel,
newdata=newData,
type='response')
predictions <- ifelse(predictions >= 0.5,1,0)
trueCategory <- newData$Category
meanClasError <- mean(predictions !=
trueCategory)
accuracy <- 1-meanClasError
accuracy # probably rather poor..? - Why? - Think!
Predictions from the Binomial Logistic Regression Model
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
Try to train a binomialregression modelmany
times by randomlyassigning
documentsto the training and test data set
What happens?Why?
*Look*at your data set and *think* about
it before actually modeling it.
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

More Related Content

What's hot (20)

PPTX
Datamining with R
Shitalkumar Sukhdeve
 
PDF
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
PDF
Introduction into R for historians (part 4: data manipulation)
Richard Zijdeman
 
PDF
Introduction into R for historians (part 3: examine and import data)
Richard Zijdeman
 
PDF
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
PPTX
Text analytics in Python and R with examples from Tobacco Control
Ben Healey
 
PPTX
Python pandas Library
Md. Sohag Miah
 
PPTX
R language
LearningTech
 
PPTX
Merge Multiple CSV in single data frame using R
Yogesh Khandelwal
 
PDF
Introduction to Rstudio
Olga Scrivner
 
PPTX
Natural Language Processing in R (rNLP)
fridolin.wild
 
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
krishna singh
 
PPTX
Hybrid acquisition of temporal scopes for rdf data
Anisa Rula
 
PDF
Lineage-driven Fault Injection, SIGMOD'15
palvaro
 
PDF
Introduction to R Programming
izahn
 
PPTX
A brief introduction to lisp language
David Gu
 
PDF
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 
PDF
useR! 2012 Talk
rtelmore
 
PPTX
R and Python, A Code Demo
Vineet Jaiswal
 
PDF
Future features for openCypher: Schema, Constraints, Subqueries, Configurable...
openCypher
 
Datamining with R
Shitalkumar Sukhdeve
 
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Introduction into R for historians (part 4: data manipulation)
Richard Zijdeman
 
Introduction into R for historians (part 3: examine and import data)
Richard Zijdeman
 
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Text analytics in Python and R with examples from Tobacco Control
Ben Healey
 
Python pandas Library
Md. Sohag Miah
 
R language
LearningTech
 
Merge Multiple CSV in single data frame using R
Yogesh Khandelwal
 
Introduction to Rstudio
Olga Scrivner
 
Natural Language Processing in R (rNLP)
fridolin.wild
 
2. R-basics, Vectors, Arrays, Matrices, Factors
krishna singh
 
Hybrid acquisition of temporal scopes for rdf data
Anisa Rula
 
Lineage-driven Fault Injection, SIGMOD'15
palvaro
 
Introduction to R Programming
izahn
 
A brief introduction to lisp language
David Gu
 
Managing large datasets in R – ff examples and concepts
Ajay Ohri
 
useR! 2012 Talk
rtelmore
 
R and Python, A Code Demo
Vineet Jaiswal
 
Future features for openCypher: Schema, Constraints, Subqueries, Configurable...
openCypher
 

Viewers also liked (20)

PDF
KogPsi2012, Fmk, Singidunum. 5. Operativna memorija
Goran S. Milovanovic
 
KEY
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen
 
PDF
Logistic Regression In Data Science
Edureka!
 
PPTX
Joseph Juran
Nitika Garg
 
PPTX
Joseph m. juran
Juju Perez
 
PDF
R crash course
Tomislav Hengl
 
PDF
KogPsi2012, Fmk, Singidunum. 7. Osnovni teorijski sistemi kognitivne psihologije
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
Goran S. Milovanovic
 
PDF
KogPsi2012, Fmk, Singidunum: Upoznavanje sa sadržajem predmeta
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 3a. Odlučivanje, II deo, nastavak
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
Goran S. Milovanovic
 
PDF
KogPsi2012, Fmk, Singidunum. 8. Analogne predstave i epizodička memorija
Goran S. Milovanovic
 
PDF
KogPsi2012, Fmk, Singidunum. 6. Semantička memorija
Goran S. Milovanovic
 
PDF
KogPsi2012, Fmk, Singidunum. 3. Modeli organizacije kognitivne obrade informa...
Goran S. Milovanovic
 
PDF
KogPsi2012, Fmk, Singidunum. 2. Istorijski razvoj kognitivne psihologije
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum. 5. Operativna memorija
Goran S. Milovanovic
 
R by example: mining Twitter for consumer attitudes towards airlines
Jeffrey Breen
 
Logistic Regression In Data Science
Edureka!
 
Joseph Juran
Nitika Garg
 
Joseph m. juran
Juju Perez
 
R crash course
Tomislav Hengl
 
KogPsi2012, Fmk, Singidunum. 7. Osnovni teorijski sistemi kognitivne psihologije
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum: Upoznavanje sa sadržajem predmeta
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 3a. Odlučivanje, II deo, nastavak
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum. 8. Analogne predstave i epizodička memorija
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum. 6. Semantička memorija
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum. 3. Modeli organizacije kognitivne obrade informa...
Goran S. Milovanovic
 
KogPsi2012, Fmk, Singidunum. 2. Istorijski razvoj kognitivne psihologije
Goran S. Milovanovic
 
Ad

Similar to Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression] (20)

PDF
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
PPTX
Text Mining of Twitter in Data Mining
Meghaj Mallick
 
DOCX
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 
PDF
Text Mining Using R
Knoldus Inc.
 
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
PDF
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Vitomir Kovanovic
 
PDF
Introduction to Text Mining
Rupak Roy
 
PPTX
hands on: Text Mining With R
Jahnab Kumar Deka
 
PDF
Data Science - Part XI - Text Analytics
Derek Kane
 
PDF
Text Mining Analytics 101
Manohar Swamynathan
 
PPT
R Basics
AllsoftSolutions
 
PDF
Sales_Prediction_Technique using R Programming
Nagarjun Kotyada
 
PPTX
Text Mining Infrastructure in R
Ashraf Uddin
 
PDF
RDataMining slides-r-programming
Yanchang Zhao
 
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
KEY
Decoder Ring
Jeff Beeman
 
PDF
R basics
Sagun Baijal
 
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
PPTX
Text mining meets neural nets
Dan Sullivan, Ph.D.
 
PDF
Practical data science_public
Long Nguyen
 
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
Text Mining of Twitter in Data Mining
Meghaj Mallick
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 
Text Mining Using R
Knoldus Inc.
 
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Vitomir Kovanovic
 
Introduction to Text Mining
Rupak Roy
 
hands on: Text Mining With R
Jahnab Kumar Deka
 
Data Science - Part XI - Text Analytics
Derek Kane
 
Text Mining Analytics 101
Manohar Swamynathan
 
Sales_Prediction_Technique using R Programming
Nagarjun Kotyada
 
Text Mining Infrastructure in R
Ashraf Uddin
 
RDataMining slides-r-programming
Yanchang Zhao
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
Decoder Ring
Jeff Beeman
 
R basics
Sagun Baijal
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Text mining meets neural nets
Dan Sullivan, Ph.D.
 
Practical data science_public
Long Nguyen
 
Ad

More from Goran S. Milovanovic (15)

PDF
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
PDF
Geneva Social Media Index - Report 2015 full report
Goran S. Milovanovic
 
PDF
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
Goran S. Milovanovic
 
PDF
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 3. Odlučivanje, II deo
Goran S. Milovanovic
 
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
Geneva Social Media Index - Report 2015 full report
Goran S. Milovanovic
 
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
Goran S. Milovanovic
 
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 3. Odlučivanje, II deo
Goran S. Milovanovic
 

Recently uploaded (20)

PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
PPTX
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
PPTX
How to use _name_search() method in Odoo 18
Celine George
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PDF
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
PDF
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PPTX
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PDF
VCE Literature Section A Exam Response Guide
jpinnuck
 
PPTX
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
DOCX
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
PPTX
Peer Teaching Observations During School Internship
AjayaMohanty7
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
PPTX
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
PDF
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
PPTX
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
How to use _name_search() method in Odoo 18
Celine George
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
VCE Literature Section A Exam Response Guide
jpinnuck
 
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
Peer Teaching Observations During School Internship
AjayaMohanty7
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

  • 1. Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science Serbia [email protected] dr Goran S. Milovanović Data Scientist at DiploFoundation Data Science Serbia [email protected] [email protected]
  • 2. Text-Mining in R + Binomial Logistic Regression • Part 1: Basics of Text-mining in R with the tm package • Web-scraping w. tm.plugin.webmining • tm package corpora structures • Metadata and content • Text transformations • Term-Document Matrix extraction with tf-idf weighting • Part 2: Introduction to Binomial Logistic Regression • Binomial Logistic Regression from Generalized Linear Models with glm() in R • Basic model assessment Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 3. ######################################################## # Introduction to R for Data Science # SESSION 8 :: 16 June, 2016 # Binomiral Logistic Regression + Intro to Text Mining in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # libraries library(tm) library(tm.plugin.webmining) # source: Google Finance # search queries: .com vs. hardware companies searchQueries <- list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'), c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM')); Intro to Text-Mining in R: tm + tm.plugin.webming packages Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 4. ######################################################## # Introduction to R for Data Science # SESSION 8 :: 16 June, 2016 # Binomiral Logistic Regression + Intro to Text Mining in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # retrive w. tm.plugin.webmining # retrieve news for dotCom dotCom <- lapply(searchQueries[[1]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); # now retrieve news for hardware companies hardware <- lapply(searchQueries[[2]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); Retrieving GoogleFinance from tm.plugin.webmining Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 5. # source: Google News searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"), c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) # retrieve news for dotCom companies dotComNews <- lapply(searchQueriesNews[[1]], function(x) { googleNewsSRC <- GoogleNewsSource(x, params = list(hl = "en", q = x, ie = "utf-8", num = 30, output = "rss")) WebCorpus(googleNewsSRC) }); Retrieving GoogleNews from tm.plugin.webmining Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 6. # tm corpora are lists # a tm corpus (WebCorpus, more specifically) dotCom[[1]] # a PlaintTextDocument in the first dotCom corpus class(dotCom[[1]][[1]]) dotCom[[1]][[1]] # document metadata dotCom[[1]][[1]]$meta # document content dotCom[[1]][[1]]$content # let's add another tag the document metadata structure: dotCom dotCom <- lapply(dotCom, function(x) { x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply meta(doc, "category") <- "dotCom" return(doc) }) }) dotCom[[1]][[1]]$meta tm corpora: metadata and content Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 7. # add a category tag to the document metadata structure: dotComNews dotComNews <- lapply(dotComNews, function(x) { x <- tm_map(x, function(doc) { meta(doc, "category") <- "dotCom" return(doc) }) }) dotComNews[[1]][[1]]$meta Combining tm corpora Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression # combine corpora w. do.call() and c() dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply # Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html hardware <- do.call(c, hardware) dotComNews <- do.call(c, dotComNews) hardwareNews <- do.call(c, hardwareNews) workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
  • 8. #### Part II Text Preprocessing ### text pre-processing workCorpus[[1]]$content # we need to clean-up the docs # reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html # Regex in R removeSpecial <- function(x) { # replacing w. space might turn out to be handy x$content <- gsub("t|r|n"," ",x$content) return(x) } # example: cleanDoc <- removeSpecial(workCorpus[[1]]) cleanDoc$content # removeSpecial with tm_map {tm} workCorpus <- tm_map(workCorpus, removeSpecial) workCorpus[[1]]$content Text Preprocessing w. tm Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 9. #### {tm} by the book text pre-processing # remove punctuation workCorpus <- tm_map(workCorpus, removePunctuation) # built in workCorpus[[1]]$content # remove numbers workCorpus <- tm_map(workCorpus, removeNumbers) # built in workCorpus[[1]]$content # all characters tolower toLower <- function(x) { x$content <- tolower(x$content) return(x) } workCorpus <- tm_map(workCorpus, toLower) workCorpus[[1]]$content # remove stop words workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in workCorpus[[1]]$content Typical preprocessing steps Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 10. # stemming # see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html library(SnowballC) # nice stemming algorithm (Porter, 1980) # see: https://cran.r-project.org/web/packages/SnowballC/index.html workCorpus <- tm_map(workCorpus, stemDocument) workCorpus[[1]]$content # remove potentially dangerous predictors: company names # Why? Why? (In a nutshell: we are trying to discover something here, # not to 'confirm' our potentially redundant knowledge) predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba", "HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) predWordsRegex <- paste(stemDocument(predWords),collapse="|"); replacePreds <- function(x) { x$content <- gsub(predWordsRegex, " ", x$content) return(x) } workCorpus <- tm_map(workCorpus, replacePreds) workCorpus[[1]]$content Stemming Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 11. #### Part III Feature Selection: Term-Document Frequency Matrix (TDM) # TDM # rows = terms; columns = docs # we will use the td-idf weighting # see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf dT <- TermDocumentMatrix(workCorpus, control = list(tolower = FALSE, wordLengths <- c(3, Inf), weighting = weightTfIdf) ) # dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package # https://cran.r-project.org/web/packages/slam/index.html class(dT) dT$i # rows w. non-zero entries dT$j # columns w. non-zero entries dT$v # non-zero entry in [i,j] dT$nrow # "true" row dimension dT$ncol # "true" column dimension dT$dimnames$Terms # self-explanatory dT$dimnames$Docs # self-explanatory Term-Document Matrix from tm: tf-idf weighting Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 12. # remove sparse terms docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1] dim(docTerm) docTerm$dimnames$Terms # see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms # A term-document matrix where those terms from x are removed which have # at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) # elements. [1] # NOTE: **very important**; in a real-world application # you would probably need to run many models with features obtained from various levels of sparcity # and perform model selection. Sparse terms Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 13. # as.matrix docTerm <- as.matrix(docTerm) colnames(docTerm) <- paste0("d",seq(1:dim(docTerm)[2])) # pick some words w w <- sample(1:length(tfIdf),4,replace = F) par(mfcol = c(2,2)) for (i in 1:length(w)) { hist(docTerm[w[i],], main = paste0("Distribution of '",names(tfIdf)[w[i]],"'"), cex.main = .85, xlab = "Tf-Idf", ylab = "Count" ) } # quite interesting, isn't it? - How do you perform regression with these? The distribution of tf-idf scores Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 14. # How well can the selected words predict the document category? # How to relate continuous, non-normally distributed predictors, to a categorical outcome? # Idea: Logistic function par(mfcol=c(1,1)) logistic <- function(t) {exp(t)/(1+exp(t))} curve(logistic, from = -10, to = 10, n = 1000, main="Logistic Function", cex.main = .85, xlab = "t", ylab = "Logistic(t)", cex.lab = .85) # and then let t be b0 + b1*x1 + b2*x2 + ... + bn*xn Logistic function Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 15. # prepare data set dataSet <- data.frame(t(docTerm)) dataSet$Category <- as.character(meta(workCorpus, tag="category")) table(dataSet$Category) # does every word play at least some role in each category? w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w1] w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w2] # recode category library(plyr) dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0))) Prepare Binomial Logistic Model Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 16. #### Logistic Regression # Binomial Logistic Regression: use glm w. logit link (link='logit' is default) bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=dataSet) sumLR <- summary(bLRmodel) sumLR # Coefficients # NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)] # logit(p) = log(p/(1-p)) - also known as log-odds sumLR$coefficients class(sumLR$coefficients) coefLR <- as.data.frame(sumLR$coefficients) # Wald statistics significant? (this Wald z is normally distributed) coefLR <- coefLR[order(-coefLR$Estimate), ] w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)"))) # which predictors worked? rownames(coefLR)[w] # NOTE: Wald statistic (z) is dangerous: as the coefficient gets higher, its standard error # inflates thus underestimating z; Beware of z... Binomial Logistic Regression Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 17. # plot coefficients {ggplot2} library(ggplot2) plotFrame <- coefLR[w,] plotFrame$Estimate <- round(plotFrame$Estimate,2) plotFrame$Features <- rownames(plotFrame) plotFrame <- plotFrame[order(- plotFrame$Estimate), ] plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T) ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) + geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) + xlab("Features") + ylab("Regression Coefficients") + ggtitle("Logistic Regression: Coeficients (sig. Wald test)") + theme(axis.text.x = element_text(angle=90)) Coefficients Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 18. # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") Fitted probabilities Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression # coefficients related to odds (and not log-odds): simply exp(bLRmodel$coefficients)[w] # check max max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
  • 19. #### Reminder: Maximum Likelihood Estimation normData <- rnorm(10000, mean = 5.75, sd = 1.25) normLogLike <- function(params,x) { mean <- params[1] sd <- abs(params[2]) # dnorm generates NaNs if sd < 0 dens <- dnorm(x,mean,sd) w <- which(dens==0) dens[w] <- .Machine$double.xmin return(-(sum(log(dens)))) # negative logLike, for minimization w. optim() } # ML estimation # random initial values startMean <- runif(1,-100,100) startSd <- runif(1,-100,100) mlFit <- optim(c(startMean, startSd), fn = normLogLike, x=normData, control = list(maxit = 50000)) # ML estimates mlFit$par # cmp. true paramaters: mean = 5.75, sd = 1.25 Reminder: Maximum-Likelihood Estimation Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 20. # model Chi-Square CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution CSq dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom dfCSq # Chi-Square significance test in R: pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance) pCSq # AIC = Akaike information criterion (-2*LogLikelihood+2*k, k = num.parameters) bLRmodel$aic Χ2 and Akaike Information Criterion Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 21. #### w. training vs. test data set # split into test and training dim(dataSet) choice <- sample(1:475,250,replace = F) test <- which(!(c(1:475) %in% choice)) trainData <- dataSet[choice,] newData <- dataSet[test,] # check! sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test # Binomial Logistic Regression: use glm w. logit link bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=trainData) sumLR <- summary(bLRmodel) sumLR Training and test data Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 22. # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") # Prediction from the model predictions <- predict(bLRmodel, newdata=newData, type='response') predictions <- ifelse(predictions >= 0.5,1,0) trueCategory <- newData$Category meanClasError <- mean(predictions != trueCategory) accuracy <- 1-meanClasError accuracy # probably rather poor..? - Why? - Think! Predictions from the Binomial Logistic Regression Model Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression Try to train a binomialregression modelmany times by randomlyassigning documentsto the training and test data set What happens?Why? *Look*at your data set and *think* about it before actually modeling it.