SlideShare a Scribd company logo
2
Most read
3
Most read
12
Most read
Topic Modeling
Karol Grzegorczyk
June 3, 2014
2/20
Generative probabilistic models of textual corpora
● In order to introduce automatic processing of natural language, a language model is
needed.
● One of the main goals of language modeling is to assign a probability to a document:
P(D) = P(w1
,w2
,w3
,...,wm
)
● It is assumed that documents in a corpus were randomly generated (it is, of course, only
a theoretical assumption; in reality, in most cases, they are created by humans)
● There are two types of generative language models:
● those that generate each word on the basis of some number of preceding words
● those that generate words based on latent topics
3/20
n-gram language models
● N-gram – a contiguous sequence of n items from a given document
– When n==m, where m is a total number number of words in a document:
P(D) = P(w1
)P(w2
|w1
)P(w3
|w1
,w2
)...P(wm
|w1
,...,wm-1
)
● Unigram – Words are independent. Each word in a document is assumed to be
randomly generated in the independent way
– P(D) = P(w1
)P(w2
)P(w3
,)...P(wm
)
● Bigram – words are generated with probability condition on the previous word
● In reality, language has long-distance dependencies
– Skip-gram is one of the solutions to this problem
4/20
Document Representations in Vector Space
● Each document is represented by an array of features
● Representation types:
– Bag-of-words (a.k.a. unigram count, term frequencies)
● A document is represented by a sparse vector of the size equals to dictionary size
– TF-IDF (term frequency–inverse document frequency)
● Similar to BoW but term frequencies are weighted, to penalize frequent words
– Topic Models (a.k.a. concept models, latent variable models)
● A document is represented by a low-rank dense vector
● Similarity between documents (or between document and a query) can be expressed in a
cosine distance
5/20
Topic Modeling
● Topic Modeling is a set of techniques that aim to discover and annotate large archives of
documents with thematic information.
● TM is a set of methods that analyze the words (or other fine-grained features) of the original
documents to discover the themes that run through them, how those themes are connected
to each other, and how they change over time.
● Often, the number of topics to be discovered is predefined.
● Topic modeling can be seen as a dimensionality reduction technique
● Topic modeling, like clustering, do not require any prior annotations or labeling, but in
contrast to clustering, can assign document to multiple topics.
● Semantic information can be derived from a word-document co-occurrence matrix
● Topic Model types:
– Linear algebra based (e.g. LSA)
– Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
6/20
Latent semantic analysis
C - normalized co-occurrence matrix
D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main
diagonal are called 'singular values'
● a.k.a. Latent Semantic Indexing
● A low-rank approximation of document-term matrix (typical rank 100-300)
● In contrast, The British National Corpus (BNC) has 100-million words
● LSA downsizes the co-occurrence tables via Singular Value Decomposition
7/20
Probabilistic Latent Semantic Analysis
pLSA models the probability of each co-occurrence as a mixture of conditionally
independent multinomial distributions:
P(d|c) relates to the V matrix from the previous slide
P(w|c) relates to the U matrix from the previous slide
In contrast to classical LSA, words representation in topics and topic representations in
documents will be non-negative and will sum up to one.
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
8/20
pLSA – Graphical Model
[T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
● A graphical model is a probabilistic
model for which a graph denotes the
conditional dependence structure
between random variables.
● Only a shaded variable is observed.
All the others have to be inferred.
● We can infer hidden variable using
maximum likelihood estimation.
● D – total number of documents
● N – total number of words in a
document (it fact, it should be Nd
)
● T – total number of topics
9/20
Latent Dirichlet allocation
● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a
Dirichlet prior
● Dirichlet distribution is a family of continuous multivariate probability distributions
● This model assumes that documents are generated randomly
● Topic is a distribution over a fixed vocabulary, each topic contains each word from the
dictionary, but some of them have very low probability
● Each word in a document is randomly selected from randomly selected topic from
distribution of topics.
● Each documents exhibit multiple topics in different proportions.
– In fact, all the documents in the collection share the same set of topics, but each
document exhibits those topics in different proportions
● In reality, the topic structure, per-document topic distributions and the per-document per-
word topic assignments are latent, and have to be inferred from observed documents.
10/20
The intuitions behind latent Dirichlet allocation
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
11/20
Real inference with LDA
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
A 100-topic LDA model was fitted to 17,000 articles from the Science journal.
At right are the top 15 most frequent words from the most frequent topics.
At left are the inferred topic proportions for the example article from previous slide.
12/20
The graphical model for latent Dirichlet allocation.
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
K – total number of topics
βk
– topic, a distribution over the vocabulary
D – total number of documents
Θd
– per-document topic proportions
N – total number of words in a document (it fact, it should be Nd
)
Zd,n
– per-word topic assignment
Wd,n
– observed word
α, η – Dirichlet parameters
● Several inference algorithms are
available (e.g. sampling based)
● A few extensions to LDA were created:
● Bigram Topic Model
13/20
Matrix Factorization Interpretation of LDA
14/20
Tooling
MAchine Learning for LanguagE Toolkit
(MALLET) is a Java-based package for:
● statistical natural language processing
● document classification
● Clustering
● topic modeling
● information extraction
● and other machine learning applications to text.
http://mallet.cs.umass.edu
gensim: topic modeling for humans
● Free python library
● Memory independent
● Distributed computing
http://radimrehurek.com/gensim
Stanford Topic Modeling Toolbox
http://nlp.stanford.edu/software/tmt
15/20
Topic modeling applications
● Topic-based text classification
● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor,
SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data.
● Topic modeling can be seen as a pre-processing step before applying supervised
learning methods.
● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in
F1 score [Cai, Hofmann, 2003]
● Collaborative filtering [Hofmann, 2004]
● Finding patterns in genetic data, images, and social networks
[L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003]
[T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
16/20
Word Representations in Vector Space
● Notion of similarity between words
● Continuous-valued vector representation of words
● Neural network language model
● Prediction semantic of the word based on the context
● Ability to perform simple algebraic operations on the word vectors:
vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen
● word2ved: https://code.google.com/p/word2vec
17/20
Topic Modeling in Computer Vision
● Bag-of-words model in computer vision (a.k.a. bag-of-features)
– Codewords (“visual words”) instead of just words
– Codebook instead of dictionary
● It is assumed, that documents exhibit multiple topics and the collection of documents
exhibits the same set of topics
● TM in CV has been used to:
– Classify images
– Build image hierarchies
– Connecting images and captions
● The main advantage of this approach it its unsupervised training nature.
18/20
Codebook example
[L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005]
Obtained from 650
training examples
from 13 categories
of natural scenes
(e.g. highway, inside
of cities, tall
buildings, forest, etc)
using k-means
clustering algorithm.
19/20
Bibliography
● D. Blei, Probabilistic topic models, Communications of the ACM, 2012
● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007
● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990
● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003
● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine
learning, 2006
● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector
space, ICLR, 2013
● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories,
Computer Vision and Pattern Recognition, 2005
● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information
Processing Systems, 2007
20/20
Thank you!

More Related Content

PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
PPT
The Toyota Way
ODP
Genetic algorithm ppt
PPTX
Big Data Analytics with Hadoop
PPT
Technical Analysis Of Stock Market
PPTX
Power BI Overview
PPTX
Understanding Reddit: The Social Media Superpower You've Probably Never Heard Of
PPTX
Introduction to Data Science.pptx
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
The Toyota Way
Genetic algorithm ppt
Big Data Analytics with Hadoop
Technical Analysis Of Stock Market
Power BI Overview
Understanding Reddit: The Social Media Superpower You've Probably Never Heard Of
Introduction to Data Science.pptx

What's hot (20)

PDF
Topic Modeling
PDF
Topic Modeling - NLP
PPT
Topic Models
PPTX
Tdm information retrieval
PDF
Topics Modeling
PDF
Natural Language Toolkit (NLTK), Basics
PDF
NLP using transformers
PDF
Basic review on topic modeling
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PPTX
Introduction to natural language processing (NLP)
PPTX
Text categorization
PPTX
Probabilistic information retrieval models & systems
PDF
Matrix Factorization In Recommender Systems
PDF
Transformer Introduction (Seminar Material)
PPTX
PPTX
Natural Language Processing
DOCX
Natural language processing
PDF
Text summarization
Topic Modeling
Topic Modeling - NLP
Topic Models
Tdm information retrieval
Topics Modeling
Natural Language Toolkit (NLTK), Basics
NLP using transformers
Basic review on topic modeling
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Introduction to natural language processing (NLP)
Text categorization
Probabilistic information retrieval models & systems
Matrix Factorization In Recommender Systems
Transformer Introduction (Seminar Material)
Natural Language Processing
Natural language processing
Text summarization
Ad

Viewers also liked (20)

POTX
LDA Beginner's Tutorial
PPT
Topic Models - LDA and Correlated Topic Models
PDF
Latent Dirichlet Allocation
PPTX
LDA presentation
PDF
EM algorithm and its application in probabilistic latent semantic analysis
PPTX
NLP and LSA getting started
PDF
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
PPTX
Latent Semanctic Analysis Auro Tripathy
PDF
Topic Models, LDA and all that
PPT
Latent Semantic Indexing and Analysis
PPT
Latent Semantic Indexing For Information Retrieval
PDF
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
PDF
Introduction to Probabilistic Latent Semantic Analysis
PDF
Terms
PPTX
Senior Project Powerpoint
PPT
Topic Models Based Personalized Spam Filter
ODP
Microservices
PDF
Large scale topic modeling
PPT
Relation Extraction
PDF
LDA on social bookmarking systems
LDA Beginner's Tutorial
Topic Models - LDA and Correlated Topic Models
Latent Dirichlet Allocation
LDA presentation
EM algorithm and its application in probabilistic latent semantic analysis
NLP and LSA getting started
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
Latent Semanctic Analysis Auro Tripathy
Topic Models, LDA and all that
Latent Semantic Indexing and Analysis
Latent Semantic Indexing For Information Retrieval
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Introduction to Probabilistic Latent Semantic Analysis
Terms
Senior Project Powerpoint
Topic Models Based Personalized Spam Filter
Microservices
Large scale topic modeling
Relation Extraction
LDA on social bookmarking systems
Ad

Similar to Topic Modeling (20)

PDF
TopicModels_BleiPaper_Summary.pptx
PPTX
Machine Learning - Intro & Applications .pptx
PDF
Topic modelling
PDF
A Text Mining Research Based on LDA Topic Modelling
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PDF
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
PDF
Latent dirichletallocation presentation
PDF
Survey of Generative Clustering Models 2008
PDF
A Neural Probabilistic Language Model_v2
PPTX
Topic Extraction on Domain Ontology
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
What's in a textbook
PPTX
Probabilistic models (part 1)
PDF
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
PDF
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
PPTX
The Geometry of Learning
PPTX
topic modelling through LDA and bertopic model
PPTX
Tdm probabilistic models (part 2)
TopicModels_BleiPaper_Summary.pptx
Machine Learning - Intro & Applications .pptx
Topic modelling
A Text Mining Research Based on LDA Topic Modelling
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A Document Exploring System on LDA Topic Model for Wikipedia Articles
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
Latent dirichletallocation presentation
Survey of Generative Clustering Models 2008
A Neural Probabilistic Language Model_v2
Topic Extraction on Domain Ontology
Frontiers of Computational Journalism week 2 - Text Analysis
What's in a textbook
Probabilistic models (part 1)
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
Datech2014 Session 2 - Reflections on Cultural Heritage and Digital Humanities
The Geometry of Learning
topic modelling through LDA and bertopic model
Tdm probabilistic models (part 2)

Recently uploaded (20)

PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
Overview of calcium in human muscles.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
Fluid dynamics vivavoce presentation of prakash
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Microbiology with diagram medical studies .pptx
PPT
protein biochemistry.ppt for university classes
C1 cut-Methane and it's Derivatives.pptx
Overview of calcium in human muscles.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
famous lake in india and its disturibution and importance
ECG_Course_Presentation د.محمد صقران ppt
. Radiology Case Scenariosssssssssssssss
Phytochemical Investigation of Miliusa longipes.pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Classification Systems_TAXONOMY_SCIENCE8.pptx
lecture 2026 of Sjogren's syndrome l .pdf
TOTAL hIP ARTHROPLASTY Presentation.pptx
Sciences of Europe No 170 (2025)
Biophysics 2.pdffffffffffffffffffffffffff
Introduction to Cardiovascular system_structure and functions-1
Fluid dynamics vivavoce presentation of prakash
Science Quipper for lesson in grade 8 Matatag Curriculum
POSITIONING IN OPERATION THEATRE ROOM.ppt
Microbiology with diagram medical studies .pptx
protein biochemistry.ppt for university classes

Topic Modeling

  • 2. 2/20 Generative probabilistic models of textual corpora ● In order to introduce automatic processing of natural language, a language model is needed. ● One of the main goals of language modeling is to assign a probability to a document: P(D) = P(w1 ,w2 ,w3 ,...,wm ) ● It is assumed that documents in a corpus were randomly generated (it is, of course, only a theoretical assumption; in reality, in most cases, they are created by humans) ● There are two types of generative language models: ● those that generate each word on the basis of some number of preceding words ● those that generate words based on latent topics
  • 3. 3/20 n-gram language models ● N-gram – a contiguous sequence of n items from a given document – When n==m, where m is a total number number of words in a document: P(D) = P(w1 )P(w2 |w1 )P(w3 |w1 ,w2 )...P(wm |w1 ,...,wm-1 ) ● Unigram – Words are independent. Each word in a document is assumed to be randomly generated in the independent way – P(D) = P(w1 )P(w2 )P(w3 ,)...P(wm ) ● Bigram – words are generated with probability condition on the previous word ● In reality, language has long-distance dependencies – Skip-gram is one of the solutions to this problem
  • 4. 4/20 Document Representations in Vector Space ● Each document is represented by an array of features ● Representation types: – Bag-of-words (a.k.a. unigram count, term frequencies) ● A document is represented by a sparse vector of the size equals to dictionary size – TF-IDF (term frequency–inverse document frequency) ● Similar to BoW but term frequencies are weighted, to penalize frequent words – Topic Models (a.k.a. concept models, latent variable models) ● A document is represented by a low-rank dense vector ● Similarity between documents (or between document and a query) can be expressed in a cosine distance
  • 5. 5/20 Topic Modeling ● Topic Modeling is a set of techniques that aim to discover and annotate large archives of documents with thematic information. ● TM is a set of methods that analyze the words (or other fine-grained features) of the original documents to discover the themes that run through them, how those themes are connected to each other, and how they change over time. ● Often, the number of topics to be discovered is predefined. ● Topic modeling can be seen as a dimensionality reduction technique ● Topic modeling, like clustering, do not require any prior annotations or labeling, but in contrast to clustering, can assign document to multiple topics. ● Semantic information can be derived from a word-document co-occurrence matrix ● Topic Model types: – Linear algebra based (e.g. LSA) – Probabilistic modeling based (e.g. pLSA, LDA, Random Projections)
  • 6. 6/20 Latent semantic analysis C - normalized co-occurrence matrix D - a diagonal matrix, all cells except the main diagonal are zeros, elements of the main diagonal are called 'singular values' ● a.k.a. Latent Semantic Indexing ● A low-rank approximation of document-term matrix (typical rank 100-300) ● In contrast, The British National Corpus (BNC) has 100-million words ● LSA downsizes the co-occurrence tables via Singular Value Decomposition
  • 7. 7/20 Probabilistic Latent Semantic Analysis pLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions: P(d|c) relates to the V matrix from the previous slide P(w|c) relates to the U matrix from the previous slide In contrast to classical LSA, words representation in topics and topic representations in documents will be non-negative and will sum up to one. [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999]
  • 8. 8/20 pLSA – Graphical Model [T. Hofmann, Probabilistic latent semantic indexing, SIGIR, 1999] ● A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. ● Only a shaded variable is observed. All the others have to be inferred. ● We can infer hidden variable using maximum likelihood estimation. ● D – total number of documents ● N – total number of words in a document (it fact, it should be Nd ) ● T – total number of topics
  • 9. 9/20 Latent Dirichlet allocation ● LDA is similar to pLSA, except that in LDA the topic distribution is assumed to have a Dirichlet prior ● Dirichlet distribution is a family of continuous multivariate probability distributions ● This model assumes that documents are generated randomly ● Topic is a distribution over a fixed vocabulary, each topic contains each word from the dictionary, but some of them have very low probability ● Each word in a document is randomly selected from randomly selected topic from distribution of topics. ● Each documents exhibit multiple topics in different proportions. – In fact, all the documents in the collection share the same set of topics, but each document exhibits those topics in different proportions ● In reality, the topic structure, per-document topic distributions and the per-document per- word topic assignments are latent, and have to be inferred from observed documents.
  • 10. 10/20 The intuitions behind latent Dirichlet allocation [D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
  • 11. 11/20 Real inference with LDA [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] A 100-topic LDA model was fitted to 17,000 articles from the Science journal. At right are the top 15 most frequent words from the most frequent topics. At left are the inferred topic proportions for the example article from previous slide.
  • 12. 12/20 The graphical model for latent Dirichlet allocation. [D. Blei, Probabilistic topic models, Communications of the ACM, 2012] K – total number of topics βk – topic, a distribution over the vocabulary D – total number of documents Θd – per-document topic proportions N – total number of words in a document (it fact, it should be Nd ) Zd,n – per-word topic assignment Wd,n – observed word α, η – Dirichlet parameters ● Several inference algorithms are available (e.g. sampling based) ● A few extensions to LDA were created: ● Bigram Topic Model
  • 14. 14/20 Tooling MAchine Learning for LanguagE Toolkit (MALLET) is a Java-based package for: ● statistical natural language processing ● document classification ● Clustering ● topic modeling ● information extraction ● and other machine learning applications to text. http://mallet.cs.umass.edu gensim: topic modeling for humans ● Free python library ● Memory independent ● Distributed computing http://radimrehurek.com/gensim Stanford Topic Modeling Toolbox http://nlp.stanford.edu/software/tmt
  • 15. 15/20 Topic modeling applications ● Topic-based text classification ● Classical text classification algorithms (e.g. perceptron, naïve bayes, k-nearest neighbor, SVM, AdaBoost, etc.) are often assuming bag-of-words representation of input data. ● Topic modeling can be seen as a pre-processing step before applying supervised learning methods. ● Using topic-based representation it is possible to gain ~0,039 in precision and ~0,046 in F1 score [Cai, Hofmann, 2003] ● Collaborative filtering [Hofmann, 2004] ● Finding patterns in genetic data, images, and social networks [L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR, 2003] [T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS, 2004]
  • 16. 16/20 Word Representations in Vector Space ● Notion of similarity between words ● Continuous-valued vector representation of words ● Neural network language model ● Prediction semantic of the word based on the context ● Ability to perform simple algebraic operations on the word vectors: vector(“King”) - vector(“Man”) + vector(“Woman”) will yield Queen ● word2ved: https://code.google.com/p/word2vec
  • 17. 17/20 Topic Modeling in Computer Vision ● Bag-of-words model in computer vision (a.k.a. bag-of-features) – Codewords (“visual words”) instead of just words – Codebook instead of dictionary ● It is assumed, that documents exhibit multiple topics and the collection of documents exhibits the same set of topics ● TM in CV has been used to: – Classify images – Build image hierarchies – Connecting images and captions ● The main advantage of this approach it its unsupervised training nature.
  • 18. 18/20 Codebook example [L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, Computer Vision and Pattern Recognition, 2005] Obtained from 650 training examples from 13 categories of natural scenes (e.g. highway, inside of cities, tall buildings, forest, etc) using k-means clustering algorithm.
  • 19. 19/20 Bibliography ● D. Blei, Probabilistic topic models, Communications of the ACM, 2012 ● M. Steyvers, T. Griffiths, Probabilistic Topic Models, Handbook of latent semantic analysis, 2007 ● S. Deerwester, et al. Indexing by latent semantic analysis, JASIS, 1990 ● D. Blei, A. Ng, M. Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003 ● H. Wallach, Topic modeling: beyond bag-of-words, International conference on Machine learning, 2006 ● T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, ICLR, 2013 ● L. Fei-Fei, P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, Computer Vision and Pattern Recognition, 2005 ● X. Wang, E. Grimson, Spatial Latent Dirichlet Allocation, Conference on Neural Information Processing Systems, 2007

Editor's Notes

  • #5: Explain difference between cosine and euclidean distances.
  • #9: Everything within a plate is replicated. Replication number is in the bottom-right corner of the plate. The hidden nodes are unshaded. The observed nodes are shades Θ – theta Φ – fi
  • #10: Special case of Dirichlet distribution is beta distribution, which is a family of continuous probability distributions defined on the interval [0, 1] A continuous probability distribution is a probability distribution that has a probability density function.
  • #13: For each document we draw topic distribution θd from distribution of topic distributions We know how many words we want in each document so we draw that many times per-word topic assignment Zd,n So probability of word Wd,n depends on topic assignment (Zd,n) and distribution of words for this topic (βk).