SlideShare a Scribd company logo
Class Outline
• Introduction: Unstructured Data Analysis
• Word-level Analysis
– Vector Space Model
– TF-IDF

• Beyond Word-level Analysis: Natural
Language Processing (NLP)
• Text Mining Demonstration in R: Mining
Twitter Data
Background: Text Mining – New MR Tool!
• Text data is everywhere – books, news, articles, financial analysis,
blogs, social networking, etc
• According to estimates, 80% of world’s data is in “unstructured text
format”
• We need methods to extract, summarize, and analyze useful
information from unstructured/text data
• Text mining seeks to automatically discover useful knowledge from
the massive amount of data
• Active research is going on in the area of text mining in industry and
academics
What is Text Mining?
• Use of computational techniques to extract high quality
information from text

• Extract and discover knowledge hidden in text automatically

• KDD definition: “discovery by computer of new previously unknown
information, by automatically extracting information from a usually
large amount of different unstructured textual resources”
Text Mining Tasks
• 1. Document Categorization (Supervised Learning)
• 2. Document Clustering/Organization (Unsupervised Learning)
• 3. Summarization (key words, indices, etc)
• 4. Visualization (word cloud, maps)
• 5. Numeric prediction (stock market prediction based on news text)
Features of Text Data
•
•
•
•
•
•
•
•

High dimensionality
Large number of features
Multiple ways to represent the same concept
Highly redundant data
Unstructured data
Easy for humans, hard for machine
Abstract ideas hard to represent
Huge amount of data to be processed
– Automation is required
Acquiring Texts
• Existing digital corpora: e.g. XML (high quality text and metadata)
– http://www.hathitrust.org/htrc

• Other digital sources (e.g. Web, twitter, Amazon consumer reviews)
– Through API: e.g. tweets
– Websites without APIs can be “scraped”
– Generally requires custom programming (Perl, Python, etc) or software tools
(e.g. Web extractor pro)

• Undigitized text
– Scanned and subjected to Optical Character Recognition (OCR)
– Time and labor intensive
– Error-prone
Word-level Analysis: Vector Space Model
• Documents are treated as a “bag” of words or terms
• Any document can be represented as a vector: a list of terms and
their associated weights
– D= {(t1,w1),(t2,w2),…………,(tn,wn )}
– ti: i-th term
– wi: weight for the i-th term

• Weight is a measure of the importance of terms of information
content
Vector Space Model: Bag of Words Representation
• Each document: Sparse high-dimensional vector!
TF-IDF: Definition
TF-IDF: Example
• TF: Consider a document containing 100 words wherein the word cow
appears 3 times. Following the previously defined formulas, what is
the term frequency (TF) for cow?
– TF(cow,d1) = 3.

• IDF: Now assume we have 10 million documents and cow appears in
one thousand of these. What is the inverse document frequency of
the term, cow?
– IDF(cow) = log(10,000,000/1,000) = 4

• TF-IDF score?
– TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
Application 1: Document Search with Query
Document ID

Cat

Dog

d1

0.397

d2

Mouse

Fish

Horse

Cow

Matching Scores

0.397 0.000

0.475

0.000

0.000

1.268

0.352

0.301 0.680

0.000

0.000

0.000

0.653

d3

0.301

0.363 0.000

0.000

0.669

0.741

0.664

d4

0.376

0.352 0.636

0.558

0.000

0.000

1.286

d5

0.301

0.301 0.000

0.426

0.544

0.544

1.028
Application 2: Word Frequencies – Zipf’s Law
• Idea: We use a few words very often, and most words very rarely,
because it’s more effort to use a rare word.

• Zipf’s Law: Product of frequency of word and its rank is [reasonably]
constant

• Empirically demonstrable; holds up over different languages
Application 2: Word Frequencies – Zipf’s Law
Application 3: Word Cloud - Budweiser Example

http://people.duke.edu/~el113/Visualizations.html
Problems with Word-level Analysis: Sentiment
• Sentiment can often be expressed in a more subtle manner, making it
difficult to be identified by any of a sentence or document’s terms
when considered in isolation
– A positive or negative sentiment word may have opposite orientations in
different application domains. (“This camera sucks.” -> negative; “This vacuum
cleaner really sucks.” -> positive)
– A sentence containing sentiment words may not express any sentiment. (e.g.
“Can you tell me which Sony camera is good?”)
– Sarcastic sentences with or without sentiment words are hard to deal with. (e.g.
“What a great car! It sopped working in two days.”
– Many sentences without sentiment words can also imply opinions. (e.g. “This
washer uses a lot of water.” -> negative)

• We have to consider the overall context (semantics of each sentence
or document)
Natural Language Processing (NLP) to the Rescue!
• NLP: is a filed of computer science, artificial intelligence, and
linguistics, concerned with the interactions between computers and
human (natural) languages.
• Key idea: Use statistical “machine learning” to automatically learn
the language from data!
• Major tasks in NLP
–
–
–
–
–
–

Automatic summarization
Part-of-speech tagging (POS tagging)
Relationship extraction
Sentiment analysis
Topic segmentation and recognition
Machine translation
Demonstration: POS Tagging – 1/2
• http://cogcomp.cs.illinois.edu/demo/pos/results.php
Demonstration: POS Tagging – 2/2
Demonstration: Sentence-level Sentiment – 1/3
• Stanford Sentiment Analyzer
– http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Demonstration: Sentence-level Sentiment – 2/3
• Review 1: This movie doesn’t care about cleverness, wit or any other
kind of intelligent humor. -> Negative
Demonstration: Sentence-level Sentiment – 3/3
• There are slow and repetitive parts, but it has just enough spice to
keep it interesting. -> Positive
• Text Mining Demonstration in R: Mining
Twitter Data
Twitter Mining in R – 1/2

Step 0) Install “R” and Packages
R program: http://www.r-project.org/
Package: http://cran.r-project.org/web/packages/tm/index.html
Package: http://cran.r-project.org/web/packages/twitteR/index.html
Package: http://cran.r-project.org/web/packages/wordcloud/index.html
Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Step 1) Retrieving Text from Twitter: Twitter API
(Using twitteR)
Twitter Mining in R – 2/2
Step 2) Transforming Text

Step 3) Stemming Words
Step 4) Build a Term-Document Matrix
Step 5) Frequent Terms and Associations

Step 6) Word Cloud
Software for Text Mining
• A number of academic/commercial software available:
– 1. Open source packages in R – e.g. tm
• R program: http://www.r-project.org/
• Package: http://cran.r-project.org/web/packages/tm/index.html
• Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

– 2. Stanford NLP core
• http://nlp.stanford.edu/software/corenlp.shtml

–
–
–
–
–

3. SAS TextMiner
4. IBM SPSS
5. Boos Texter
6. StatSoft
7. AeroText

• Text Data is everywhere – you can mine it to gain insights!

More Related Content

PPTX
Text mining
PPTX
Deep learning
PPT
Text mining
PPTX
Deep learning health care
PDF
Text summarization
PPT
4.4 text mining
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Text mining
Deep learning
Text mining
Deep learning health care
Text summarization
4.4 text mining
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...

What's hot (20)

PPTX
3. introduction to text mining
PPTX
Text mining
PDF
NLP using transformers
PPTX
Text MIning
PPT
Big Data & Text Mining
PPTX
Introduction to Deep learning
PPTX
web mining
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Natural Language Processing (NLP)
PPTX
Presentation on supervised learning
PDF
Natural Language Processing with Python
ODP
Topic Modeling
PPTX
PDF
Introduction to Natural Language Processing (NLP)
PPT
Data Mining Techniques
PPTX
Word embedding
PPTX
Attention Is All You Need
PDF
Data Science Project Lifecycle
PDF
Natural Language Processing seminar review
ODP
Machine Learning with Decision trees
3. introduction to text mining
Text mining
NLP using transformers
Text MIning
Big Data & Text Mining
Introduction to Deep learning
web mining
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Natural Language Processing (NLP)
Presentation on supervised learning
Natural Language Processing with Python
Topic Modeling
Introduction to Natural Language Processing (NLP)
Data Mining Techniques
Word embedding
Attention Is All You Need
Data Science Project Lifecycle
Natural Language Processing seminar review
Machine Learning with Decision trees
Ad

Similar to Introduction to Text Mining (20)

PDF
Data Acquisition for Sentiment Analysis
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
Building NLP solutions using Python
PPTX
Natural Language Processing (NLP).pptx
PPTX
NLP, Expert system and pattern recognition
PPTX
Building NLP solutions for Davidson ML Group
KEY
Big data 4 webmonday
PDF
Big Data Analytics course: Named Entities and Deep Learning for NLP
PPTX
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
PDF
Introduction to Natural Language Processing (NLP)
PPTX
Final presentation
PDF
A Gentle Introduction to Text Analysis :)
PPTX
Taming Text
PPT
Lecture1 Natural Language Processing for
PPTX
AI Technology Overview and Career Advice
PPTX
Deep learning introduction
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PPTX
Digitization in theory and practice
PPTX
aistudy-240521200530-db141c56 RAG AI.pptx
PPTX
Knowledge base system appl. p 1,2-ver1
Data Acquisition for Sentiment Analysis
Natural Language Processing, Techniques, Current Trends and Applications in I...
Building NLP solutions using Python
Natural Language Processing (NLP).pptx
NLP, Expert system and pattern recognition
Building NLP solutions for Davidson ML Group
Big data 4 webmonday
Big Data Analytics course: Named Entities and Deep Learning for NLP
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
Introduction to Natural Language Processing (NLP)
Final presentation
A Gentle Introduction to Text Analysis :)
Taming Text
Lecture1 Natural Language Processing for
AI Technology Overview and Career Advice
Deep learning introduction
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Digitization in theory and practice
aistudy-240521200530-db141c56 RAG AI.pptx
Knowledge base system appl. p 1,2-ver1
Ad

More from Minha Hwang (14)

PPTX
Marketing Experiment - Part II: Analysis
PPTX
Marketing Experimentation - Part I
PPTX
Introduction to Recommendation System
PPTX
Promotion Analytics - Module 2: Model and Estimation
PPTX
Promotion Analytics in Consumer Electronics - Module 1: Data
PPTX
Dummy Variable Regression Analysis
PPTX
Multiple Regression Analysis
PPTX
Introduction to Regression Analysis
PPTX
Conjoint Analysis Part 3/3 - Market Simulator
PPTX
Conjoint Analysis - Part 2/3
PPTX
Conjoint Analysis - Part 1/3
PPTX
Marketing Research - Perceptual Map
PDF
Channel capabilities, product characteristics, and impacts of mobile channel ...
PPTX
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
Marketing Experiment - Part II: Analysis
Marketing Experimentation - Part I
Introduction to Recommendation System
Promotion Analytics - Module 2: Model and Estimation
Promotion Analytics in Consumer Electronics - Module 1: Data
Dummy Variable Regression Analysis
Multiple Regression Analysis
Introduction to Regression Analysis
Conjoint Analysis Part 3/3 - Market Simulator
Conjoint Analysis - Part 2/3
Conjoint Analysis - Part 1/3
Marketing Research - Perceptual Map
Channel capabilities, product characteristics, and impacts of mobile channel ...
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...

Recently uploaded (20)

PDF
Missing skill for SEO in AI Era eSkydecode.pdf
PDF
exceptionalinsights.group visitor traffic statistics 08-08-25
PPT
Market research before Marketing Research .PPT
PPTX
"Best Healthcare Digital Marketing Ideas
PPTX
Best Digital marketing service provider in Chandigarh.pptx
PDF
EVOLUTION OF RURAL MARKETING IN INDIAN CIVILIZATION
PPTX
PRINCIPLES OF MANAGEMENT and functions (1).pptx
PDF
How to Break Into AI Search with Andrew Holland
PDF
Digital Marketing - clear pictire of marketing
PPTX
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
PDF
Building a strong social media presence.
PPTX
hnk joint business plan for_Rooftop_Plan
PPTX
Presentation - MindfulHeal Digital Ayurveda GTM & Marketing Plan.pptx
PDF
E_Book_Customer_Relation_Management_0.pdf
PDF
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
PDF
Mastering Content Strategy in 2025 ss.pdf
PDF
Wondershare Filmora Crack Free Download 2025
PDF
sm_67a1bc7f35716dcb1a9195ea_382528b8-2159-47be-a7ba-d034a449f849.pdf
PDF
20K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
PPTX
Your score increases as you pick a category, fill out a long description and ...
Missing skill for SEO in AI Era eSkydecode.pdf
exceptionalinsights.group visitor traffic statistics 08-08-25
Market research before Marketing Research .PPT
"Best Healthcare Digital Marketing Ideas
Best Digital marketing service provider in Chandigarh.pptx
EVOLUTION OF RURAL MARKETING IN INDIAN CIVILIZATION
PRINCIPLES OF MANAGEMENT and functions (1).pptx
How to Break Into AI Search with Andrew Holland
Digital Marketing - clear pictire of marketing
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
Building a strong social media presence.
hnk joint business plan for_Rooftop_Plan
Presentation - MindfulHeal Digital Ayurveda GTM & Marketing Plan.pptx
E_Book_Customer_Relation_Management_0.pdf
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
Mastering Content Strategy in 2025 ss.pdf
Wondershare Filmora Crack Free Download 2025
sm_67a1bc7f35716dcb1a9195ea_382528b8-2159-47be-a7ba-d034a449f849.pdf
20K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
Your score increases as you pick a category, fill out a long description and ...

Introduction to Text Mining

  • 1. Class Outline • Introduction: Unstructured Data Analysis • Word-level Analysis – Vector Space Model – TF-IDF • Beyond Word-level Analysis: Natural Language Processing (NLP) • Text Mining Demonstration in R: Mining Twitter Data
  • 2. Background: Text Mining – New MR Tool! • Text data is everywhere – books, news, articles, financial analysis, blogs, social networking, etc • According to estimates, 80% of world’s data is in “unstructured text format” • We need methods to extract, summarize, and analyze useful information from unstructured/text data • Text mining seeks to automatically discover useful knowledge from the massive amount of data • Active research is going on in the area of text mining in industry and academics
  • 3. What is Text Mining? • Use of computational techniques to extract high quality information from text • Extract and discover knowledge hidden in text automatically • KDD definition: “discovery by computer of new previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources”
  • 4. Text Mining Tasks • 1. Document Categorization (Supervised Learning) • 2. Document Clustering/Organization (Unsupervised Learning) • 3. Summarization (key words, indices, etc) • 4. Visualization (word cloud, maps) • 5. Numeric prediction (stock market prediction based on news text)
  • 5. Features of Text Data • • • • • • • • High dimensionality Large number of features Multiple ways to represent the same concept Highly redundant data Unstructured data Easy for humans, hard for machine Abstract ideas hard to represent Huge amount of data to be processed – Automation is required
  • 6. Acquiring Texts • Existing digital corpora: e.g. XML (high quality text and metadata) – http://www.hathitrust.org/htrc • Other digital sources (e.g. Web, twitter, Amazon consumer reviews) – Through API: e.g. tweets – Websites without APIs can be “scraped” – Generally requires custom programming (Perl, Python, etc) or software tools (e.g. Web extractor pro) • Undigitized text – Scanned and subjected to Optical Character Recognition (OCR) – Time and labor intensive – Error-prone
  • 7. Word-level Analysis: Vector Space Model • Documents are treated as a “bag” of words or terms • Any document can be represented as a vector: a list of terms and their associated weights – D= {(t1,w1),(t2,w2),…………,(tn,wn )} – ti: i-th term – wi: weight for the i-th term • Weight is a measure of the importance of terms of information content
  • 8. Vector Space Model: Bag of Words Representation • Each document: Sparse high-dimensional vector!
  • 10. TF-IDF: Example • TF: Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, what is the term frequency (TF) for cow? – TF(cow,d1) = 3. • IDF: Now assume we have 10 million documents and cow appears in one thousand of these. What is the inverse document frequency of the term, cow? – IDF(cow) = log(10,000,000/1,000) = 4 • TF-IDF score? – TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
  • 11. Application 1: Document Search with Query Document ID Cat Dog d1 0.397 d2 Mouse Fish Horse Cow Matching Scores 0.397 0.000 0.475 0.000 0.000 1.268 0.352 0.301 0.680 0.000 0.000 0.000 0.653 d3 0.301 0.363 0.000 0.000 0.669 0.741 0.664 d4 0.376 0.352 0.636 0.558 0.000 0.000 1.286 d5 0.301 0.301 0.000 0.426 0.544 0.544 1.028
  • 12. Application 2: Word Frequencies – Zipf’s Law • Idea: We use a few words very often, and most words very rarely, because it’s more effort to use a rare word. • Zipf’s Law: Product of frequency of word and its rank is [reasonably] constant • Empirically demonstrable; holds up over different languages
  • 13. Application 2: Word Frequencies – Zipf’s Law
  • 14. Application 3: Word Cloud - Budweiser Example http://people.duke.edu/~el113/Visualizations.html
  • 15. Problems with Word-level Analysis: Sentiment • Sentiment can often be expressed in a more subtle manner, making it difficult to be identified by any of a sentence or document’s terms when considered in isolation – A positive or negative sentiment word may have opposite orientations in different application domains. (“This camera sucks.” -> negative; “This vacuum cleaner really sucks.” -> positive) – A sentence containing sentiment words may not express any sentiment. (e.g. “Can you tell me which Sony camera is good?”) – Sarcastic sentences with or without sentiment words are hard to deal with. (e.g. “What a great car! It sopped working in two days.” – Many sentences without sentiment words can also imply opinions. (e.g. “This washer uses a lot of water.” -> negative) • We have to consider the overall context (semantics of each sentence or document)
  • 16. Natural Language Processing (NLP) to the Rescue! • NLP: is a filed of computer science, artificial intelligence, and linguistics, concerned with the interactions between computers and human (natural) languages. • Key idea: Use statistical “machine learning” to automatically learn the language from data! • Major tasks in NLP – – – – – – Automatic summarization Part-of-speech tagging (POS tagging) Relationship extraction Sentiment analysis Topic segmentation and recognition Machine translation
  • 17. Demonstration: POS Tagging – 1/2 • http://cogcomp.cs.illinois.edu/demo/pos/results.php
  • 19. Demonstration: Sentence-level Sentiment – 1/3 • Stanford Sentiment Analyzer – http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
  • 20. Demonstration: Sentence-level Sentiment – 2/3 • Review 1: This movie doesn’t care about cleverness, wit or any other kind of intelligent humor. -> Negative
  • 21. Demonstration: Sentence-level Sentiment – 3/3 • There are slow and repetitive parts, but it has just enough spice to keep it interesting. -> Positive
  • 22. • Text Mining Demonstration in R: Mining Twitter Data
  • 23. Twitter Mining in R – 1/2 Step 0) Install “R” and Packages R program: http://www.r-project.org/ Package: http://cran.r-project.org/web/packages/tm/index.html Package: http://cran.r-project.org/web/packages/twitteR/index.html Package: http://cran.r-project.org/web/packages/wordcloud/index.html Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf Step 1) Retrieving Text from Twitter: Twitter API (Using twitteR)
  • 24. Twitter Mining in R – 2/2 Step 2) Transforming Text Step 3) Stemming Words Step 4) Build a Term-Document Matrix Step 5) Frequent Terms and Associations Step 6) Word Cloud
  • 25. Software for Text Mining • A number of academic/commercial software available: – 1. Open source packages in R – e.g. tm • R program: http://www.r-project.org/ • Package: http://cran.r-project.org/web/packages/tm/index.html • Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf – 2. Stanford NLP core • http://nlp.stanford.edu/software/corenlp.shtml – – – – – 3. SAS TextMiner 4. IBM SPSS 5. Boos Texter 6. StatSoft 7. AeroText • Text Data is everywhere – you can mine it to gain insights!