SlideShare a Scribd company logo
2
Most read
4
Most read
6
Most read
Introduction to
Tm package
What is Text mining?
Text mining is the process of exploring and analyzing large amounts
of unstructured data that can be used to identify concepts, patterns,
topics, keywords and other attributes.
Common challenges of text mining:
 Each word and phrase can be high number of possible dimensions.
 Data are in unstructured form unlike in other data mining techniques
data are found in structure tabular format.
 Even statistically not independent.
 Ambiguity “the quality of being open to more than one
interpretation; inexactness.”
Rupak Roy
Text mining applications
• Customer Relationship management (CRM)
• Market Analysis
• NLP (natural language processing)
• Personalization in E-Commerce
• Natural language processing (or NLP) is a field of Ai and a
component of text mining that performs linguistic analysis that
essentially helps machine to deal with understanding, analyzing,
languages that humans naturally is good at. NLP uses a variety of
methodologies to decipher the ambiguities in human language, like
automatic summarization, speech tagging, entity extraction and
relations extraction, as well as disambiguation and natural language
understanding and recognition.
Rupak Roy
Modeling Techniques
• Supervised Learning
• Unsupervised learning
Supervised Learning: where we use labeled data to train our model to
classify new data and as we know in supervised learning we direct i.e.
train our ML model using labeled data.
For example sentimental analysis using classification methods like svm.
Unsupervised Learning: is the vice versa of supervised learning. It doesn't
require labeled data to train the model and validate over test data,
instead it will use the available unlabeled data to develop the model to
classify the problems and the solutions.
For example: Clustering, topic modeling.
Rupak Roy
Tm(text mining) package in R
Tm is a base R package for Pre processing the text data like
1. Remove unnecessary lines then convert text to a corpus(a structured
set of texts in tabular format)
2. Then read and inspect the Corpus to create TDM (term document
matrix)
Corpus- A corpus or text corpus is a large and structured set of texts.
a) In a corpus we parse the data to extract words, remove
punctuations, spaces even lower and upper case to make it
uniform.
b) Then remove words that has no meaning by itself like was, as, a, it
etc. also called as Stop words.
c) Finally apply Stemming which is the process of reducing
derived words to their word stem, base or root form. Eg. Consult,
Consulting, Consultation, Consultants = Consult(same meaning)
Term Document Matrix
Term Document Matrix (TDM) is a matrix that describes the frequency of
terms that occur in a collection of documents.
ROWS = TERMS
Columns = DOCUMENTS
Document term Matrix
Term Document Matrix
One of the common function widely used for cleaning the
data(corpus) like remove whitespaces, punctuations, numbers is
tm_map() function from base tm R package.
Rupak Roy
i like hate Data Science
D1 1 1 0 1
D2 1 0 1 1
D1 D2
I 1 1
Like 1 0
Hate 0 1
Data
Science
1 1
Term Document Matrix
Now what we can we do with Term Document Matrix (TDM)?
* We can easy find the frequent terms occur in the document which is
helpful to understand the keywords. For example very helpful to
understand they Google search keywords.
* We can also find association that are co-related or similar of each
words, how they are related to each other.
* Group the words that have same or similar performance by Clustering
techniques.
* Sentimental Analysis: is the automated process of understanding an
opinion like negative, positive or neutral about a given subject from
written or spoken language helping a business to understand the social
sentiment of their brand, product or service.
Rupak Roy
Example
#load the data
>star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ")
>View(star_wars_EPV)
>str(star_wars_EPV)
>names(star_wars_EPV)
#Convert to a dataframe ‘only second column’
>dialogue<-data.frame(star_wars_EPV$dialogue)
#Renaming the column
>names(dialogue)<-"dialogue"
>str(dialogue)
Rupak Roy
Example
#data preprocessing using TM package
>library(tm)
#build text corpus
>dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue))
>summary(dialogue.corpus)
>inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus
#clean the data
>inspect(dialogue.corpus[1:5])
#Converting to lower case
>dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower))
#Removing extra white space
>dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace)
#Removing punctuations
>dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation)
#Removing numbers
>dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
Example
#Create a list of stop words, the words that have no meaning itself.
>my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟)
#Remove the stop words
>dialogue.corpus<-
tm_map(dialogue.corpus,removeWords,my_stopwords)
#Build term document matrix
>dialogue.tdm<-TermDocumentMatrix(dialogue.corpus)
>dialogue.tdm
>dim(dialogue.tdm) #Dimensions of term document matrix
>inspect(dialogue.tdm[1:10,1:10])
#Remove sparse terms (Words that occur infrequently)
#here 97% refers remove at least 97% of sparse
>dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
Example
#Finding word and frequencies
>temp<-inspect(dialogue.imp)
>wordFreq<-data.frame(apply(temp, 1, sum))
>wordFreq<-data.frame(ST = row.names(wordFreq), Freq =
wordFreq[,1])
>head(wordFreq)
>wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ]
>View(wordFreq)
Rupak Roy
Example
##Basic Analysis
#Finding the most frequent terms/words
findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times
findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times
findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times
findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times
#Finding association between terms/words
findAssocs(dialogue.tdm,"dont",0.3)
findAssocs(dialogue.tdm,"get",0.2)
findAssocs(dialogue.tdm,"right",0.2)
findAssocs(dialogue.tdm,"will",0.3)
findAssocs(dialogue.tdm,"know",0.3)
findAssocs(dialogue.tdm,"good",0.3)
Building Word Cloud
#Visualization using WordCloud
>library("wordcloud")
>library("RColorBrewer")
#Word Cloud requires text corpus and not term document matrix
#How to choose colors?
?brewer.pal
display.brewer.all() #Gives you a chart
brewer.pal #Helps you identify the groups of pallete colors
display.brewer.pal(8,"Dark2")
display.brewer.pal(8,"Purples")
display.brewer.pal(3,"Oranges")
set8<-brewer.pal(8,"Dark2")
Rupak Roy
Building Word Cloud
#plot the word cloud
wordcloud(dialogue.corpus,min.freq=10,
max.words=60,
random.order=T,colors=set8)
wordcloud(dialogue.corpus,min.freq=10,max.words=60,
random.order=T,
colors=set8,vfont=c("script","plain"))
Rupak Roy
Next
We will learn how to use regular expression tools to find and replace the
text.
Rupak Roy

More Related Content

PPTX
Analytical tools
PPT
Basics of Machine Learning
PPTX
Data Mining: Applying data mining
PPTX
Python libraries for data science
PPTX
Text mining
PPTX
Visualization and Matplotlib using Python.pptx
PPTX
Google data studio
PPTX
Joins And Its Types
Analytical tools
Basics of Machine Learning
Data Mining: Applying data mining
Python libraries for data science
Text mining
Visualization and Matplotlib using Python.pptx
Google data studio
Joins And Its Types

What's hot (20)

PPTX
Presentation machine learning
PDF
Information retrieval-systems notes
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PPTX
Intro/Overview on Machine Learning Presentation
PPTX
Text data mining1
PPTX
WEKA: The Experimenter
PPTX
Introduction to SQL
PPT
Data literacy
PPT
Natural language processing
PPTX
Fundamentals of Programming Constructs.pptx
PPTX
Search Engine
PPTX
Data visualization
PDF
pandas - Python Data Analysis
PPTX
Social Media Sentiments Analysis
PPT
1.2 steps and functionalities
PPTX
FAKE NEWS DETECTION PPT
PPTX
BACKUP & RECOVERY IN DBMS
PPT
DATA WAREHOUSING AND DATA MINING
PPT
Data preprocessing
PPTX
Data Preprocessing || Data Mining
Presentation machine learning
Information retrieval-systems notes
IRS-Cataloging and Indexing-2.1.pptx
Intro/Overview on Machine Learning Presentation
Text data mining1
WEKA: The Experimenter
Introduction to SQL
Data literacy
Natural language processing
Fundamentals of Programming Constructs.pptx
Search Engine
Data visualization
pandas - Python Data Analysis
Social Media Sentiments Analysis
1.2 steps and functionalities
FAKE NEWS DETECTION PPT
BACKUP & RECOVERY IN DBMS
DATA WAREHOUSING AND DATA MINING
Data preprocessing
Data Preprocessing || Data Mining
Ad

Similar to Introduction to Text Mining (20)

PDF
RDataMining slides-text-mining-with-r
PPTX
Text Mining
PPTX
Text Analytics
PDF
Text Mining Analytics 101
PPTX
sentiment analysis
PPTX
Introduction to Text Mining
DOCX
Text classification
PPT
Text Mining
PPTX
Text Mining Infrastructure in R
PPTX
Text Mining
PDF
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
PDF
Data Science - Part XI - Text Analytics
PDF
A Pragmatic Approach to Facilitating Text and Data Mining
PDF
Text Mining with R -- an Analysis of Twitter Data
PDF
Analysing Demonetisation through Text Mining using Live Twitter Data!
PPT
Text mining and data mining
PPTX
MODULE 4-Text Analytics.pptx
PDF
Text Analysis Report
PPTX
Text mining
PPTX
Text mining introduction-1
RDataMining slides-text-mining-with-r
Text Mining
Text Analytics
Text Mining Analytics 101
sentiment analysis
Introduction to Text Mining
Text classification
Text Mining
Text Mining Infrastructure in R
Text Mining
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Data Science - Part XI - Text Analytics
A Pragmatic Approach to Facilitating Text and Data Mining
Text Mining with R -- an Analysis of Twitter Data
Analysing Demonetisation through Text Mining using Live Twitter Data!
Text mining and data mining
MODULE 4-Text Analytics.pptx
Text Analysis Report
Text mining
Text mining introduction-1
Ad

More from Rupak Roy (20)

PDF
Hierarchical Clustering - Text Mining/NLP
PDF
Clustering K means and Hierarchical - NLP
PDF
Network Analysis - NLP
PDF
Topic Modeling - NLP
PDF
Sentiment Analysis Practical Steps
PDF
NLP - Sentiment Analysis
PDF
Text Mining using Regular Expressions
PDF
Apache Hbase Architecture
PDF
Introduction to Hbase
PDF
Apache Hive Table Partition and HQL
PDF
Installing Apache Hive, internal and external table, import-export
PDF
Introductive to Hive
PDF
Scoop Job, import and export to RDBMS
PDF
Apache Scoop - Import with Append mode and Last Modified mode
PDF
Introduction to scoop and its functions
PDF
Introduction to Flume
PDF
Apache Pig Relational Operators - II
PDF
Passing Parameters using File and Command Line
PDF
Apache PIG Relational Operations
PDF
Apache PIG casting, reference
Hierarchical Clustering - Text Mining/NLP
Clustering K means and Hierarchical - NLP
Network Analysis - NLP
Topic Modeling - NLP
Sentiment Analysis Practical Steps
NLP - Sentiment Analysis
Text Mining using Regular Expressions
Apache Hbase Architecture
Introduction to Hbase
Apache Hive Table Partition and HQL
Installing Apache Hive, internal and external table, import-export
Introductive to Hive
Scoop Job, import and export to RDBMS
Apache Scoop - Import with Append mode and Last Modified mode
Introduction to scoop and its functions
Introduction to Flume
Apache Pig Relational Operators - II
Passing Parameters using File and Command Line
Apache PIG Relational Operations
Apache PIG casting, reference

Recently uploaded (20)

PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
Yogi Goddess Pres Conference Studio Updates
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Updated Idioms and Phrasal Verbs in English subject
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Lesson notes of climatology university.
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Yogi Goddess Pres Conference Studio Updates
Orientation - ARALprogram of Deped to the Parents.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Cell Types and Its function , kingdom of life
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Updated Idioms and Phrasal Verbs in English subject
Paper A Mock Exam 9_ Attempt review.pdf.
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Lesson notes of climatology university.
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Weekly quiz Compilation Jan -July 25.pdf
History, Philosophy and sociology of education (1).pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
UNIT III MENTAL HEALTH NURSING ASSESSMENT
2.FourierTransform-ShortQuestionswithAnswers.pdf

Introduction to Text Mining

  • 2. What is Text mining? Text mining is the process of exploring and analyzing large amounts of unstructured data that can be used to identify concepts, patterns, topics, keywords and other attributes. Common challenges of text mining:  Each word and phrase can be high number of possible dimensions.  Data are in unstructured form unlike in other data mining techniques data are found in structure tabular format.  Even statistically not independent.  Ambiguity “the quality of being open to more than one interpretation; inexactness.” Rupak Roy
  • 3. Text mining applications • Customer Relationship management (CRM) • Market Analysis • NLP (natural language processing) • Personalization in E-Commerce • Natural language processing (or NLP) is a field of Ai and a component of text mining that performs linguistic analysis that essentially helps machine to deal with understanding, analyzing, languages that humans naturally is good at. NLP uses a variety of methodologies to decipher the ambiguities in human language, like automatic summarization, speech tagging, entity extraction and relations extraction, as well as disambiguation and natural language understanding and recognition. Rupak Roy
  • 4. Modeling Techniques • Supervised Learning • Unsupervised learning Supervised Learning: where we use labeled data to train our model to classify new data and as we know in supervised learning we direct i.e. train our ML model using labeled data. For example sentimental analysis using classification methods like svm. Unsupervised Learning: is the vice versa of supervised learning. It doesn't require labeled data to train the model and validate over test data, instead it will use the available unlabeled data to develop the model to classify the problems and the solutions. For example: Clustering, topic modeling. Rupak Roy
  • 5. Tm(text mining) package in R Tm is a base R package for Pre processing the text data like 1. Remove unnecessary lines then convert text to a corpus(a structured set of texts in tabular format) 2. Then read and inspect the Corpus to create TDM (term document matrix) Corpus- A corpus or text corpus is a large and structured set of texts. a) In a corpus we parse the data to extract words, remove punctuations, spaces even lower and upper case to make it uniform. b) Then remove words that has no meaning by itself like was, as, a, it etc. also called as Stop words. c) Finally apply Stemming which is the process of reducing derived words to their word stem, base or root form. Eg. Consult, Consulting, Consultation, Consultants = Consult(same meaning)
  • 6. Term Document Matrix Term Document Matrix (TDM) is a matrix that describes the frequency of terms that occur in a collection of documents. ROWS = TERMS Columns = DOCUMENTS Document term Matrix Term Document Matrix One of the common function widely used for cleaning the data(corpus) like remove whitespaces, punctuations, numbers is tm_map() function from base tm R package. Rupak Roy i like hate Data Science D1 1 1 0 1 D2 1 0 1 1 D1 D2 I 1 1 Like 1 0 Hate 0 1 Data Science 1 1
  • 7. Term Document Matrix Now what we can we do with Term Document Matrix (TDM)? * We can easy find the frequent terms occur in the document which is helpful to understand the keywords. For example very helpful to understand they Google search keywords. * We can also find association that are co-related or similar of each words, how they are related to each other. * Group the words that have same or similar performance by Clustering techniques. * Sentimental Analysis: is the automated process of understanding an opinion like negative, positive or neutral about a given subject from written or spoken language helping a business to understand the social sentiment of their brand, product or service. Rupak Roy
  • 8. Example #load the data >star_wars_EPV<-read.csv("SW_EpisodeV.txt",h=TRUE,sep = " ") >View(star_wars_EPV) >str(star_wars_EPV) >names(star_wars_EPV) #Convert to a dataframe ‘only second column’ >dialogue<-data.frame(star_wars_EPV$dialogue) #Renaming the column >names(dialogue)<-"dialogue" >str(dialogue) Rupak Roy
  • 9. Example #data preprocessing using TM package >library(tm) #build text corpus >dialogue.corpus<-Corpus(VectorSource(dialogue$dialogue)) >summary(dialogue.corpus) >inspect(dialogue.corpus[1:5]) #Inspecting elements in Corpus #clean the data >inspect(dialogue.corpus[1:5]) #Converting to lower case >dialogue.corpus<-tm_map(dialogue.corpus,content_transformer(tolower)) #Removing extra white space >dialogue.corpus<-tm_map(dialogue.corpus,stripWhitespace) #Removing punctuations >dialogue.corpus<-tm_map(dialogue.corpus,removePunctuation) #Removing numbers >dialogue.corpus<-tm_map(dialogue.corpus,removeNumbers)
  • 10. Example #Create a list of stop words, the words that have no meaning itself. >my_stopwords<-c(stopwords('english'),‟@‟,'http*„,‟url‟,‟www*‟) #Remove the stop words >dialogue.corpus<- tm_map(dialogue.corpus,removeWords,my_stopwords) #Build term document matrix >dialogue.tdm<-TermDocumentMatrix(dialogue.corpus) >dialogue.tdm >dim(dialogue.tdm) #Dimensions of term document matrix >inspect(dialogue.tdm[1:10,1:10]) #Remove sparse terms (Words that occur infrequently) #here 97% refers remove at least 97% of sparse >dialogue.imp<-removeSparseTerms(dialogue.tdm,0.97)
  • 11. Example #Finding word and frequencies >temp<-inspect(dialogue.imp) >wordFreq<-data.frame(apply(temp, 1, sum)) >wordFreq<-data.frame(ST = row.names(wordFreq), Freq = wordFreq[,1]) >head(wordFreq) >wordFreq<-wordFreq[order(wordFreq$Freq, decreasing = T), ] >View(wordFreq) Rupak Roy
  • 12. Example ##Basic Analysis #Finding the most frequent terms/words findFreqTerms(dialogue.tdm,10) #Occurring minimum of 10 times findFreqTerms(dialogue.tdm,30) #Occurring minimum of 30 times findFreqTerms(dialogue.tdm,50) #Occurring minimum of 50 times findFreqTerms(dialogue.tdm,70) #Occurring minimum of 70 times #Finding association between terms/words findAssocs(dialogue.tdm,"dont",0.3) findAssocs(dialogue.tdm,"get",0.2) findAssocs(dialogue.tdm,"right",0.2) findAssocs(dialogue.tdm,"will",0.3) findAssocs(dialogue.tdm,"know",0.3) findAssocs(dialogue.tdm,"good",0.3)
  • 13. Building Word Cloud #Visualization using WordCloud >library("wordcloud") >library("RColorBrewer") #Word Cloud requires text corpus and not term document matrix #How to choose colors? ?brewer.pal display.brewer.all() #Gives you a chart brewer.pal #Helps you identify the groups of pallete colors display.brewer.pal(8,"Dark2") display.brewer.pal(8,"Purples") display.brewer.pal(3,"Oranges") set8<-brewer.pal(8,"Dark2") Rupak Roy
  • 14. Building Word Cloud #plot the word cloud wordcloud(dialogue.corpus,min.freq=10, max.words=60, random.order=T,colors=set8) wordcloud(dialogue.corpus,min.freq=10,max.words=60, random.order=T, colors=set8,vfont=c("script","plain")) Rupak Roy
  • 15. Next We will learn how to use regular expression tools to find and replace the text. Rupak Roy