Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science Serbia
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science Serbia
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
Text-Mining in R + Binomial
Logistic Regression
• Part 1: Basics of Text-mining in R
with the tm package
• Web-scraping w. tm.plugin.webmining
• tm package corpora structures
• Metadata and content
• Text transformations
• Term-Document Matrix extraction with
tf-idf weighting
• Part 2: Introduction to Binomial
Logistic Regression
• Binomial Logistic Regression from
Generalized Linear Models with glm() in
R
• Basic model assessment
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
# libraries
library(tm)
library(tm.plugin.webmining)
# source: Google Finance
# search queries: .com vs. hardware companies
searchQueries <-
list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'),
c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM'));
Intro to Text-Mining in R: tm + tm.plugin.webming packages
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
########################################################
# Introduction to R for Data Science
# SESSION 8 :: 16 June, 2016
# Binomiral Logistic Regression + Intro to Text Mining in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
# retrive w. tm.plugin.webmining
# retrieve news for dotCom
dotCom <- lapply(searchQueries[[1]], function(x) {
WebCorpus(GoogleFinanceSource(x))
});
# now retrieve news for hardware companies
hardware <- lapply(searchQueries[[2]], function(x) {
WebCorpus(GoogleFinanceSource(x))
});
Retrieving GoogleFinance from tm.plugin.webmining
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# source: Google News
searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"),
c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
# retrieve news for dotCom companies
dotComNews <- lapply(searchQueriesNews[[1]], function(x) {
googleNewsSRC <- GoogleNewsSource(x,
params = list(hl = "en",
q = x,
ie = "utf-8",
num = 30,
output = "rss"))
WebCorpus(googleNewsSRC)
});
Retrieving GoogleNews from tm.plugin.webmining
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# tm corpora are lists
# a tm corpus (WebCorpus, more specifically)
dotCom[[1]]
# a PlaintTextDocument in the first dotCom corpus
class(dotCom[[1]][[1]])
dotCom[[1]][[1]]
# document metadata
dotCom[[1]][[1]]$meta
# document content
dotCom[[1]][[1]]$content
# let's add another tag the document metadata structure: dotCom
dotCom <- lapply(dotCom, function(x) {
x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply
meta(doc, "category") <- "dotCom"
return(doc)
})
})
dotCom[[1]][[1]]$meta
tm corpora: metadata and content
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# add a category tag to the document metadata structure: dotComNews
dotComNews <- lapply(dotComNews, function(x) {
x <- tm_map(x, function(doc) {
meta(doc, "category") <- "dotCom"
return(doc)
})
})
dotComNews[[1]][[1]]$meta
Combining tm corpora
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# combine corpora w. do.call() and c()
dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply
# Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html
hardware <- do.call(c, hardware)
dotComNews <- do.call(c, dotComNews)
hardwareNews <- do.call(c, hardwareNews)
workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
#### Part II Text Preprocessing
### text pre-processing
workCorpus[[1]]$content # we need to clean-up the docs
# reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
# Regex in R
removeSpecial <- function(x) {
# replacing w. space might turn out to be handy
x$content <- gsub("t|r|n"," ",x$content)
return(x)
}
# example:
cleanDoc <- removeSpecial(workCorpus[[1]])
cleanDoc$content
# removeSpecial with tm_map {tm}
workCorpus <- tm_map(workCorpus, removeSpecial)
workCorpus[[1]]$content
Text Preprocessing w. tm
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### {tm} by the book text pre-processing
# remove punctuation
workCorpus <- tm_map(workCorpus, removePunctuation) # built in
workCorpus[[1]]$content
# remove numbers
workCorpus <- tm_map(workCorpus, removeNumbers) # built in
workCorpus[[1]]$content
# all characters tolower
toLower <- function(x) {
x$content <- tolower(x$content)
return(x)
}
workCorpus <- tm_map(workCorpus, toLower)
workCorpus[[1]]$content
# remove stop words
workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in
workCorpus[[1]]$content
Typical preprocessing steps
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# stemming
# see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
library(SnowballC) # nice stemming algorithm (Porter, 1980)
# see: https://cran.r-project.org/web/packages/SnowballC/index.html
workCorpus <- tm_map(workCorpus, stemDocument)
workCorpus[[1]]$content
# remove potentially dangerous predictors: company names
# Why? Why? (In a nutshell: we are trying to discover something here,
# not to 'confirm' our potentially redundant knowledge)
predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba",
"HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM"))
predWordsRegex <- paste(stemDocument(predWords),collapse="|");
replacePreds <- function(x) {
x$content <- gsub(predWordsRegex,
" ",
x$content)
return(x)
}
workCorpus <- tm_map(workCorpus, replacePreds)
workCorpus[[1]]$content
Stemming
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### Part III Feature Selection: Term-Document Frequency Matrix (TDM)
# TDM
# rows = terms; columns = docs
# we will use the td-idf weighting
# see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
dT <- TermDocumentMatrix(workCorpus,
control = list(tolower = FALSE,
wordLengths <- c(3, Inf),
weighting = weightTfIdf)
)
# dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package
# https://cran.r-project.org/web/packages/slam/index.html
class(dT)
dT$i # rows w. non-zero entries
dT$j # columns w. non-zero entries
dT$v # non-zero entry in [i,j]
dT$nrow # "true" row dimension
dT$ncol # "true" column dimension
dT$dimnames$Terms # self-explanatory
dT$dimnames$Docs # self-explanatory
Term-Document Matrix from tm: tf-idf weighting
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# remove sparse terms
docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1]
dim(docTerm)
docTerm$dimnames$Terms
# see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms
# A term-document matrix where those terms from x are removed which have
# at least a sparse percentage of empty (i.e., terms occurring 0 times in a document)
# elements. [1]
# NOTE: **very important**; in a real-world application
# you would probably need to run many models with features obtained from various levels of sparcity
# and perform model selection.
Sparse terms
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# as.matrix
docTerm <- as.matrix(docTerm)
colnames(docTerm) <-
paste0("d",seq(1:dim(docTerm)[2]))
# pick some words w
w <- sample(1:length(tfIdf),4,replace = F)
par(mfcol = c(2,2))
for (i in 1:length(w)) {
hist(docTerm[w[i],],
main = paste0("Distribution of
'",names(tfIdf)[w[i]],"'"),
cex.main = .85,
xlab = "Tf-Idf",
ylab = "Count"
)
} # quite interesting, isn't it? - How do you
perform regression with these?
The distribution of tf-idf scores
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# How well can the selected words predict the
document category?
# How to relate continuous, non-normally
distributed predictors, to a categorical outcome?
# Idea: Logistic function
par(mfcol=c(1,1))
logistic <- function(t) {exp(t)/(1+exp(t))}
curve(logistic, from = -10, to = 10, n = 1000,
main="Logistic Function",
cex.main = .85,
xlab = "t",
ylab = "Logistic(t)",
cex.lab = .85)
# and then let t be b0 + b1*x1 + b2*x2 + ... +
bn*xn
Logistic function
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# prepare data set
dataSet <- data.frame(t(docTerm))
dataSet$Category <- as.character(meta(workCorpus, tag="category"))
table(dataSet$Category)
# does every word play at least some role in each category?
w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w1]
w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0)
colnames(dataSet)[w2]
# recode category
library(plyr)
dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0)))
Prepare Binomial Logistic Model
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### Logistic Regression
# Binomial Logistic Regression: use glm w. logit link (link='logit' is default)
bLRmodel <- glm(Category ~.,
family=binomial(link='logit'),
control = list(maxit = 500),
data=dataSet)
sumLR <- summary(bLRmodel)
sumLR
# Coefficients
# NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)]
# logit(p) = log(p/(1-p)) - also known as log-odds
sumLR$coefficients
class(sumLR$coefficients)
coefLR <- as.data.frame(sumLR$coefficients)
# Wald statistics significant? (this Wald z is normally distributed)
coefLR <- coefLR[order(-coefLR$Estimate), ]
w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)")))
# which predictors worked?
rownames(coefLR)[w]
# NOTE: Wald statistic (z) is dangerous: as the coefficient gets higher, its standard error
# inflates thus underestimating z; Beware of z...
Binomial Logistic Regression
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# plot coefficients {ggplot2}
library(ggplot2)
plotFrame <- coefLR[w,]
plotFrame$Estimate <-
round(plotFrame$Estimate,2)
plotFrame$Features <- rownames(plotFrame)
plotFrame <- plotFrame[order(-
plotFrame$Estimate), ]
plotFrame$Features <-
factor(plotFrame$Features, levels =
plotFrame$Features, ordered=T)
ggplot(data = plotFrame, aes(x =
plotFrame$Features, y = plotFrame$Estimate)) +
geom_line(group=1) + geom_point(color="red",
size=2.5) + geom_point(color="white", size=2) +
xlab("Features") + ylab("Regression
Coefficients") +
ggtitle("Logistic Regression: Coeficients (sig.
Wald test)") +
theme(axis.text.x = element_text(angle=90))
Coefficients
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)),
col="red",
border="black")
Fitted probabilities
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# coefficients related to odds (and not log-odds): simply
exp(bLRmodel$coefficients)[w]
# check max
max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
#### Reminder: Maximum Likelihood Estimation
normData <- rnorm(10000, mean = 5.75, sd = 1.25)
normLogLike <- function(params,x) {
mean <- params[1]
sd <- abs(params[2]) # dnorm generates NaNs if sd < 0
dens <- dnorm(x,mean,sd)
w <- which(dens==0)
dens[w] <- .Machine$double.xmin
return(-(sum(log(dens)))) # negative logLike, for minimization w. optim()
}
# ML estimation
# random initial values
startMean <- runif(1,-100,100)
startSd <- runif(1,-100,100)
mlFit <- optim(c(startMean, startSd),
fn = normLogLike,
x=normData,
control = list(maxit = 50000))
# ML estimates
mlFit$par
# cmp. true paramaters: mean = 5.75, sd = 1.25
Reminder: Maximum-Likelihood Estimation
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# model Chi-Square
CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution
CSq
dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom
dfCSq
# Chi-Square significance test in R:
pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance)
pCSq
# AIC = Akaike information criterion (-2*LogLikelihood+2*k, k = num.parameters)
bLRmodel$aic
Χ2 and Akaike Information Criterion
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
#### w. training vs. test data set
# split into test and training
dim(dataSet)
choice <- sample(1:475,250,replace = F)
test <- which(!(c(1:475) %in% choice))
trainData <- dataSet[choice,]
newData <- dataSet[test,]
# check!
sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training
sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test
# Binomial Logistic Regression: use glm w. logit link
bLRmodel <- glm(Category ~.,
family=binomial(link='logit'),
control = list(maxit = 500),
data=trainData)
sumLR <- summary(bLRmodel)
sumLR
Training and test data
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
# fitted probabilities
fitted(bLRmodel)
hist(fitted(bLRmodel),50)
plot(density(fitted(bLRmodel)),
main = "Predicted Probabilities: Density")
polygon(density(fitted(bLRmodel)),
col="red",
border="black")
# Prediction from the model
predictions <- predict(bLRmodel,
newdata=newData,
type='response')
predictions <- ifelse(predictions >= 0.5,1,0)
trueCategory <- newData$Category
meanClasError <- mean(predictions !=
trueCategory)
accuracy <- 1-meanClasError
accuracy # probably rather poor..? - Why? - Think!
Predictions from the Binomial Logistic Regression Model
Intro to R for Data Science
Session 8: Intro to text-mining in R + Binomial Logistic Regression
Try to train a binomialregression modelmany
times by randomlyassigning
documentsto the training and test data set
What happens?Why?
*Look*at your data set and *think* about
it before actually modeling it.
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

More Related Content

PDF
Introduction to R for Data Science :: Session 4
PDF
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
PDF
Introduction to R for Data Science :: Session 3
PDF
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
PDF
Introduction to R for Data Science :: Session 1
PDF
Introduction to R for Data Science :: Session 2
PDF
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
PDF
Introduction to Data Mining with R and Data Import/Export in R
Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to Data Mining with R and Data Import/Export in R

What's hot (20)

PPTX
Datamining with R
PDF
RDataMining slides-text-mining-with-r
PDF
Introduction into R for historians (part 4: data manipulation)
PDF
Introduction into R for historians (part 3: examine and import data)
PDF
Text mining and social network analysis of twitter data part 1
PPTX
Text analytics in Python and R with examples from Tobacco Control
PPTX
Python pandas Library
PPTX
R language
PPTX
Merge Multiple CSV in single data frame using R
PDF
Introduction to Rstudio
PPTX
Natural Language Processing in R (rNLP)
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PPTX
Hybrid acquisition of temporal scopes for rdf data
PDF
Lineage-driven Fault Injection, SIGMOD'15
PDF
Introduction to R Programming
PPTX
A brief introduction to lisp language
PDF
Managing large datasets in R – ff examples and concepts
PDF
useR! 2012 Talk
PPTX
R and Python, A Code Demo
PDF
Future features for openCypher: Schema, Constraints, Subqueries, Configurable...
Datamining with R
RDataMining slides-text-mining-with-r
Introduction into R for historians (part 4: data manipulation)
Introduction into R for historians (part 3: examine and import data)
Text mining and social network analysis of twitter data part 1
Text analytics in Python and R with examples from Tobacco Control
Python pandas Library
R language
Merge Multiple CSV in single data frame using R
Introduction to Rstudio
Natural Language Processing in R (rNLP)
2. R-basics, Vectors, Arrays, Matrices, Factors
Hybrid acquisition of temporal scopes for rdf data
Lineage-driven Fault Injection, SIGMOD'15
Introduction to R Programming
A brief introduction to lisp language
Managing large datasets in R – ff examples and concepts
useR! 2012 Talk
R and Python, A Code Demo
Future features for openCypher: Schema, Constraints, Subqueries, Configurable...
Ad

Viewers also liked (20)

PDF
KogPsi2012, Fmk, Singidunum. 5. Operativna memorija
KEY
R by example: mining Twitter for consumer attitudes towards airlines
PDF
Logistic Regression In Data Science
PPTX
Joseph Juran
PPTX
Joseph m. juran
PDF
R crash course
PDF
KogPsi2012, Fmk, Singidunum. 7. Osnovni teorijski sistemi kognitivne psihologije
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
PDF
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
PDF
KogPsi2012, Fmk, Singidunum: Upoznavanje sa sadržajem predmeta
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
PDF
Učenje i viši kognitivni procesi 3a. Odlučivanje, II deo, nastavak
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
PDF
KogPsi2012, Fmk, Singidunum. 8. Analogne predstave i epizodička memorija
PDF
KogPsi2012, Fmk, Singidunum. 6. Semantička memorija
PDF
KogPsi2012, Fmk, Singidunum. 3. Modeli organizacije kognitivne obrade informa...
PDF
KogPsi2012, Fmk, Singidunum. 2. Istorijski razvoj kognitivne psihologije
KogPsi2012, Fmk, Singidunum. 5. Operativna memorija
R by example: mining Twitter for consumer attitudes towards airlines
Logistic Regression In Data Science
Joseph Juran
Joseph m. juran
R crash course
KogPsi2012, Fmk, Singidunum. 7. Osnovni teorijski sistemi kognitivne psihologije
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
KogPsi2012, Fmk, Singidunum: Upoznavanje sa sadržajem predmeta
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Učenje i viši kognitivni procesi 3a. Odlučivanje, II deo, nastavak
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
KogPsi2012, Fmk, Singidunum. 8. Analogne predstave i epizodička memorija
KogPsi2012, Fmk, Singidunum. 6. Semantička memorija
KogPsi2012, Fmk, Singidunum. 3. Modeli organizacije kognitivne obrade informa...
KogPsi2012, Fmk, Singidunum. 2. Istorijski razvoj kognitivne psihologije
Ad

Similar to Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression] (20)

PPTX
Text Mining of Twitter in Data Mining
PDF
Text Mining with R -- an Analysis of Twitter Data
PPT
Lecture1_R Programming Introduction1.ppt
PPT
Profiling and optimization
PPT
Brief introduction to R Lecturenotes1_R .ppt
PPT
R_Language_study_forstudents_R_Material.ppt
PPT
Modeling in R Programming Language for Beginers.ppt
PPT
Lecture1_R.ppt
PPT
Lecture1_R.ppt
PPT
Lecture1 r
PDF
Reproducibility with R
PDF
R for Pythonistas (PyData NYC 2017)
PDF
Introduction to TensorFlow 2.0
PDF
Introduction to Scalding and Monoids
PPTX
美洲杯投注-美洲杯投注外围投注-美洲杯投注外围投注平台|【​网址​🎉ac10.net🎉​】
PPT
r,rstats,r language,r packages
PDF
R basics
PDF
Coding in GO - GDG SL - NSBM
PDF
Yevhen Tatarynov "From POC to High-Performance .NET applications"
PPTX
Introduction to Map-Reduce Programming with Hadoop
Text Mining of Twitter in Data Mining
Text Mining with R -- an Analysis of Twitter Data
Lecture1_R Programming Introduction1.ppt
Profiling and optimization
Brief introduction to R Lecturenotes1_R .ppt
R_Language_study_forstudents_R_Material.ppt
Modeling in R Programming Language for Beginers.ppt
Lecture1_R.ppt
Lecture1_R.ppt
Lecture1 r
Reproducibility with R
R for Pythonistas (PyData NYC 2017)
Introduction to TensorFlow 2.0
Introduction to Scalding and Monoids
美洲杯投注-美洲杯投注外围投注-美洲杯投注外围投注平台|【​网址​🎉ac10.net🎉​】
r,rstats,r language,r packages
R basics
Coding in GO - GDG SL - NSBM
Yevhen Tatarynov "From POC to High-Performance .NET applications"
Introduction to Map-Reduce Programming with Hadoop

More from Goran S. Milovanovic (15)

PDF
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
PDF
Geneva Social Media Index - Report 2015 full report
PDF
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
PDF
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
PDF
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
PDF
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
PDF
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
PDF
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
PDF
Učenje i viši kognitivni procesi 5. Učenje, I Deo
PDF
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
PDF
Učenje i viši kognitivni procesi 3. Odlučivanje, II deo
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Geneva Social Media Index - Report 2015 full report
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Učenje i viši kognitivni procesi 3. Odlučivanje, II deo

Recently uploaded (20)

PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
PDF
International_Financial_Reporting_Standa.pdf
PPTX
What’s under the hood: Parsing standardized learning content for AI
PDF
1.Salivary gland disease.pdf 3.Bleeding and Clotting Disorders.pdf important
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
Literature_Review_methods_ BRACU_MKT426 course material
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
PDF
Journal of Dental Science - UDMY (2022).pdf
PPTX
Module on health assessment of CHN. pptx
PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
PPTX
Climate Change and Its Global Impact.pptx
PDF
HVAC Specification 2024 according to central public works department
PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PDF
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
PDF
Farming Based Livelihood Systems English Notes
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
Race Reva University – Shaping Future Leaders in Artificial Intelligence
International_Financial_Reporting_Standa.pdf
What’s under the hood: Parsing standardized learning content for AI
1.Salivary gland disease.pdf 3.Bleeding and Clotting Disorders.pdf important
Environmental Education MCQ BD2EE - Share Source.pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Journal of Dental Science - UDMY (2021).pdf
Literature_Review_methods_ BRACU_MKT426 course material
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
Journal of Dental Science - UDMY (2022).pdf
Module on health assessment of CHN. pptx
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
Climate Change and Its Global Impact.pptx
HVAC Specification 2024 according to central public works department
Core Concepts of Personalized Learning and Virtual Learning Environments
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
Farming Based Livelihood Systems English Notes

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

  • 1. Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science Serbia [email protected] dr Goran S. Milovanović Data Scientist at DiploFoundation Data Science Serbia [email protected] [email protected]
  • 2. Text-Mining in R + Binomial Logistic Regression • Part 1: Basics of Text-mining in R with the tm package • Web-scraping w. tm.plugin.webmining • tm package corpora structures • Metadata and content • Text transformations • Term-Document Matrix extraction with tf-idf weighting • Part 2: Introduction to Binomial Logistic Regression • Binomial Logistic Regression from Generalized Linear Models with glm() in R • Basic model assessment Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 3. ######################################################## # Introduction to R for Data Science # SESSION 8 :: 16 June, 2016 # Binomiral Logistic Regression + Intro to Text Mining in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # libraries library(tm) library(tm.plugin.webmining) # source: Google Finance # search queries: .com vs. hardware companies searchQueries <- list(c('NASDAQ:GOOGL','NASDAQ:AMZN','NASDAQ:JD','NASDAQ:FB','NYSE:BABA'), c('NYSE:HPQ','NASDAQ:AAPL','KRX:005930','TPE:2354','NYSE:IBM')); Intro to Text-Mining in R: tm + tm.plugin.webming packages Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 4. ######################################################## # Introduction to R for Data Science # SESSION 8 :: 16 June, 2016 # Binomiral Logistic Regression + Intro to Text Mining in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # retrive w. tm.plugin.webmining # retrieve news for dotCom dotCom <- lapply(searchQueries[[1]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); # now retrieve news for hardware companies hardware <- lapply(searchQueries[[2]], function(x) { WebCorpus(GoogleFinanceSource(x)) }); Retrieving GoogleFinance from tm.plugin.webmining Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 5. # source: Google News searchQueriesNews <- list(c("Google", "Amazon", "JD.com", "Facebook", "Alibaba"), c("Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) # retrieve news for dotCom companies dotComNews <- lapply(searchQueriesNews[[1]], function(x) { googleNewsSRC <- GoogleNewsSource(x, params = list(hl = "en", q = x, ie = "utf-8", num = 30, output = "rss")) WebCorpus(googleNewsSRC) }); Retrieving GoogleNews from tm.plugin.webmining Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 6. # tm corpora are lists # a tm corpus (WebCorpus, more specifically) dotCom[[1]] # a PlaintTextDocument in the first dotCom corpus class(dotCom[[1]][[1]]) dotCom[[1]][[1]] # document metadata dotCom[[1]][[1]]$meta # document content dotCom[[1]][[1]]$content # let's add another tag the document metadata structure: dotCom dotCom <- lapply(dotCom, function(x) { x <- tm_map(x, function(doc) { # tm_map works over tm corpora, similarly to lapply meta(doc, "category") <- "dotCom" return(doc) }) }) dotCom[[1]][[1]]$meta tm corpora: metadata and content Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 7. # add a category tag to the document metadata structure: dotComNews dotComNews <- lapply(dotComNews, function(x) { x <- tm_map(x, function(doc) { meta(doc, "category") <- "dotCom" return(doc) }) }) dotComNews[[1]][[1]]$meta Combining tm corpora Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression # combine corpora w. do.call() and c() dotCom <- do.call(c, dotCom) # do.call comes handy; similar to lapply # Learn more about do.call: http://www.stat.berkeley.edu/~s133/Docall.html hardware <- do.call(c, hardware) dotComNews <- do.call(c, dotComNews) hardwareNews <- do.call(c, hardwareNews) workCorpus <- c(dotCom, dotComNews, hardware, hardwareNews)
  • 8. #### Part II Text Preprocessing ### text pre-processing workCorpus[[1]]$content # we need to clean-up the docs # reminder: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html # Regex in R removeSpecial <- function(x) { # replacing w. space might turn out to be handy x$content <- gsub("t|r|n"," ",x$content) return(x) } # example: cleanDoc <- removeSpecial(workCorpus[[1]]) cleanDoc$content # removeSpecial with tm_map {tm} workCorpus <- tm_map(workCorpus, removeSpecial) workCorpus[[1]]$content Text Preprocessing w. tm Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 9. #### {tm} by the book text pre-processing # remove punctuation workCorpus <- tm_map(workCorpus, removePunctuation) # built in workCorpus[[1]]$content # remove numbers workCorpus <- tm_map(workCorpus, removeNumbers) # built in workCorpus[[1]]$content # all characters tolower toLower <- function(x) { x$content <- tolower(x$content) return(x) } workCorpus <- tm_map(workCorpus, toLower) workCorpus[[1]]$content # remove stop words workCorpus <- tm_map(workCorpus, removeWords, stopwords("english")) # built in workCorpus[[1]]$content Typical preprocessing steps Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 10. # stemming # see: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html library(SnowballC) # nice stemming algorithm (Porter, 1980) # see: https://cran.r-project.org/web/packages/SnowballC/index.html workCorpus <- tm_map(workCorpus, stemDocument) workCorpus[[1]]$content # remove potentially dangerous predictors: company names # Why? Why? (In a nutshell: we are trying to discover something here, # not to 'confirm' our potentially redundant knowledge) predWords <- tolower(c("Alphabet", "Google", "Amazon", "JD.com", "Facebook", "Alibaba", "HP", "Hewlett Packard", "Apple", "Samsung", "Foxconn", "IBM")) predWordsRegex <- paste(stemDocument(predWords),collapse="|"); replacePreds <- function(x) { x$content <- gsub(predWordsRegex, " ", x$content) return(x) } workCorpus <- tm_map(workCorpus, replacePreds) workCorpus[[1]]$content Stemming Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 11. #### Part III Feature Selection: Term-Document Frequency Matrix (TDM) # TDM # rows = terms; columns = docs # we will use the td-idf weighting # see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf dT <- TermDocumentMatrix(workCorpus, control = list(tolower = FALSE, wordLengths <- c(3, Inf), weighting = weightTfIdf) ) # dT is a *sparse matrix*; maybe you want to learn more about the R {slam} package # https://cran.r-project.org/web/packages/slam/index.html class(dT) dT$i # rows w. non-zero entries dT$j # columns w. non-zero entries dT$v # non-zero entry in [i,j] dT$nrow # "true" row dimension dT$ncol # "true" column dimension dT$dimnames$Terms # self-explanatory dT$dimnames$Docs # self-explanatory Term-Document Matrix from tm: tf-idf weighting Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 12. # remove sparse terms docTerm <- removeSparseTerms(dT, sparse = .75) # sparse is in (0,1] dim(docTerm) docTerm$dimnames$Terms # see: [1] http://www.inside-r.org/packages/cran/tm/docs/removeSparseTerms # A term-document matrix where those terms from x are removed which have # at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) # elements. [1] # NOTE: **very important**; in a real-world application # you would probably need to run many models with features obtained from various levels of sparcity # and perform model selection. Sparse terms Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 13. # as.matrix docTerm <- as.matrix(docTerm) colnames(docTerm) <- paste0("d",seq(1:dim(docTerm)[2])) # pick some words w w <- sample(1:length(tfIdf),4,replace = F) par(mfcol = c(2,2)) for (i in 1:length(w)) { hist(docTerm[w[i],], main = paste0("Distribution of '",names(tfIdf)[w[i]],"'"), cex.main = .85, xlab = "Tf-Idf", ylab = "Count" ) } # quite interesting, isn't it? - How do you perform regression with these? The distribution of tf-idf scores Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 14. # How well can the selected words predict the document category? # How to relate continuous, non-normally distributed predictors, to a categorical outcome? # Idea: Logistic function par(mfcol=c(1,1)) logistic <- function(t) {exp(t)/(1+exp(t))} curve(logistic, from = -10, to = 10, n = 1000, main="Logistic Function", cex.main = .85, xlab = "t", ylab = "Logistic(t)", cex.lab = .85) # and then let t be b0 + b1*x1 + b2*x2 + ... + bn*xn Logistic function Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 15. # prepare data set dataSet <- data.frame(t(docTerm)) dataSet$Category <- as.character(meta(workCorpus, tag="category")) table(dataSet$Category) # does every word play at least some role in each category? w1 <- which(colSums(dataSet[which(dataSet$Category == "dotCom"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w1] w2 <- which(colSums(dataSet[which(dataSet$Category == "hardware"), 1:(dim(dataSet)[2]-1)])==0) colnames(dataSet)[w2] # recode category library(plyr) dataSet$Category <- as.numeric(revalue(dataSet$Category, c("dotCom"=1, "hardware"=0))) Prepare Binomial Logistic Model Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 16. #### Logistic Regression # Binomial Logistic Regression: use glm w. logit link (link='logit' is default) bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=dataSet) sumLR <- summary(bLRmodel) sumLR # Coefficients # NOTE: coefficients here relate a unit change in the predictor to the logit[P(Outcome)] # logit(p) = log(p/(1-p)) - also known as log-odds sumLR$coefficients class(sumLR$coefficients) coefLR <- as.data.frame(sumLR$coefficients) # Wald statistics significant? (this Wald z is normally distributed) coefLR <- coefLR[order(-coefLR$Estimate), ] w <- which((coefLR$`Pr(>|z|)` < .05)&(!(rownames(coefLR) == "(Intercept)"))) # which predictors worked? rownames(coefLR)[w] # NOTE: Wald statistic (z) is dangerous: as the coefficient gets higher, its standard error # inflates thus underestimating z; Beware of z... Binomial Logistic Regression Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 17. # plot coefficients {ggplot2} library(ggplot2) plotFrame <- coefLR[w,] plotFrame$Estimate <- round(plotFrame$Estimate,2) plotFrame$Features <- rownames(plotFrame) plotFrame <- plotFrame[order(- plotFrame$Estimate), ] plotFrame$Features <- factor(plotFrame$Features, levels = plotFrame$Features, ordered=T) ggplot(data = plotFrame, aes(x = plotFrame$Features, y = plotFrame$Estimate)) + geom_line(group=1) + geom_point(color="red", size=2.5) + geom_point(color="white", size=2) + xlab("Features") + ylab("Regression Coefficients") + ggtitle("Logistic Regression: Coeficients (sig. Wald test)") + theme(axis.text.x = element_text(angle=90)) Coefficients Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 18. # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") Fitted probabilities Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression # coefficients related to odds (and not log-odds): simply exp(bLRmodel$coefficients)[w] # check max max(exp(bLRmodel$coefficients)[w]) # huge? why? - think!
  • 19. #### Reminder: Maximum Likelihood Estimation normData <- rnorm(10000, mean = 5.75, sd = 1.25) normLogLike <- function(params,x) { mean <- params[1] sd <- abs(params[2]) # dnorm generates NaNs if sd < 0 dens <- dnorm(x,mean,sd) w <- which(dens==0) dens[w] <- .Machine$double.xmin return(-(sum(log(dens)))) # negative logLike, for minimization w. optim() } # ML estimation # random initial values startMean <- runif(1,-100,100) startSd <- runif(1,-100,100) mlFit <- optim(c(startMean, startSd), fn = normLogLike, x=normData, control = list(maxit = 50000)) # ML estimates mlFit$par # cmp. true paramaters: mean = 5.75, sd = 1.25 Reminder: Maximum-Likelihood Estimation Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 20. # model Chi-Square CSq <- bLRmodel$null.deviance - bLRmodel$deviance # this difference ~ Chi-Square Distribution CSq dfCSq <- bLRmodel$df.null - bLRmodel$df.residual # null - residual (model) degrees of freedom dfCSq # Chi-Square significance test in R: pCSq = 1-pchisq(CSq, dfCSq) # 1 - c.d.f. = P(Chi-Square larger than this obtained by chance) pCSq # AIC = Akaike information criterion (-2*LogLikelihood+2*k, k = num.parameters) bLRmodel$aic Χ2 and Akaike Information Criterion Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 21. #### w. training vs. test data set # split into test and training dim(dataSet) choice <- sample(1:475,250,replace = F) test <- which(!(c(1:475) %in% choice)) trainData <- dataSet[choice,] newData <- dataSet[test,] # check! sum(dataSet$Category[choice])/length(choice) # proportion of dotCom in training sum(dataSet$Category[test])/length(choice) # proportion of dotCom in test # Binomial Logistic Regression: use glm w. logit link bLRmodel <- glm(Category ~., family=binomial(link='logit'), control = list(maxit = 500), data=trainData) sumLR <- summary(bLRmodel) sumLR Training and test data Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression
  • 22. # fitted probabilities fitted(bLRmodel) hist(fitted(bLRmodel),50) plot(density(fitted(bLRmodel)), main = "Predicted Probabilities: Density") polygon(density(fitted(bLRmodel)), col="red", border="black") # Prediction from the model predictions <- predict(bLRmodel, newdata=newData, type='response') predictions <- ifelse(predictions >= 0.5,1,0) trueCategory <- newData$Category meanClasError <- mean(predictions != trueCategory) accuracy <- 1-meanClasError accuracy # probably rather poor..? - Why? - Think! Predictions from the Binomial Logistic Regression Model Intro to R for Data Science Session 8: Intro to text-mining in R + Binomial Logistic Regression Try to train a binomialregression modelmany times by randomlyassigning documentsto the training and test data set What happens?Why? *Look*at your data set and *think* about it before actually modeling it.