Visualizing Data Using t-SNE

Good visualization
Mathematical framework
Implementation
Visualizing Data Using t-SNE
David Khosid
Dec. 21, 2015
1 / 20

Good visualization
Implementation
Agenda
Good visualization
Mechanics of t-SNE
Examples: image, text, voice
Scalability: large datasets visualization, up to tens of millions
Implementations: scikit-learn, Matlab, Torch
2 / 20

Good visualization
Implementation
MNIST visualization with PDA
This PDA visualization is terrible
3 / 20

Good visualization
Implementation
MNIST visualization with t-SNE in 2D
t-SNE visualization can help you identify various clusters.
Youtube link to 3D t-SNE
(a) MNIST in t-SNE (b) Learning animation (view with Adobe
Reader)
4 / 20

Good visualization
Implementation
Good visualization (requirements)
Each high-dimensional object is represented by a
low-dimensional object.
Preserve the neighborhood
Distant points correspond to dissimilar objects
Scalability: large, high-dimensional data sets.
5 / 20

Good visualization
Implementation
Manifold Learning
Manifolds
MNIST: 10 intrinsic
dimensions in 28x28 images
Images - ˜100 dims
Text - ˜1000 dims
PCA
PCA is mainly concerned
dimensionality, with preserving
large pairwise distances in the
map
Swiss Roll
6 / 20

Good visualization
Implementation
Idea of t-SNE
A data point - is a point xi in the original data space RD
A map point - is a point yi in the map space R2/R3. Every
map point represents one of the original data points
t-SNE is a visualization algorithm that choose positions of the
map points in R2/R3
t-SNE procedure:
1 Compute an N × N similarity matrix in the original RD space
2 Deﬁne an N × N similarity matrix the low-dimensional
embedding space - a learn objective
3 Deﬁne cost function - Kullback-Leibler divergence between
the two probability distributions
4 Learn low-dimensional embedding
Result: t-SNE focuses on accurately modelling small pairwise
distances, i.e., on preserving local data structure in the R2/R3
7 / 20

Good visualization
Implementation
Conditional similarity between two data points
Similarity of datapoints (xi ) in data space RD
pj|i =
exp(−
xi −xj
2
2σ2
i
)
k=m exp(− xk −xm
2
2σ2
i
)
pj|i measures how close xj is from xi , considering Gaussian
distribution around xi with a given variance σ2
i .
8 / 20

Good visualization
Implementation
Symmetric similarity
Similarity of datapoints (xi ) in data space RD
pj|i =
exp(−
xi −xj
2
2σ2
i
)
k=m exp(− xk −xm
2
2σ2
i
)
(1)
Make the similarity metric pij symmetric. The main advantage of
symmetry is simplifying the gradient (learning stage):
pij =
pi|j + pj|i
2N
(2)
we set pii = 0, as we interested in pairwise similarities
σi is chosen such that the data point has a ﬁxed perplexity
(eﬀective number of neighbors).
9 / 20

Good visualization
Implementation
Similarity of map points in Low Dimension
Student t-distribution with one degree of freedom (same as Cauchy
distribution)
qij =
(1 + yi − yj
2)−1
k=m(1 + yk − ym
2)−1
(3)
we set qii = 0, as we interested in pairwise similarities
heavy-tail (will be discussed later)
still closely related to the Gaussian
computationally convenient (no exponent)
10 / 20

Good visualization
Implementation
Kullback-Leibler divergence (Cost Function)
(pij) is ﬁxed, (qij) is ﬂexible.
We want (pij) and (qij) to be as close as possible.
C =
i
KL(Pi Qi ) =
i j
pji log
pij
qij
(4)
KL divergence:
is not a distance, since it is asymmetric
large pij modelled by small qij → large penalty
Small pij modelled by large qij → small penalty
KL divergence meaning: cross-entropy
11 / 20

Good visualization
Implementation
Learning: Gradient of t-SNE
t-SNE algorithm minimizes KL divergence between P and Q
distributions.
∂C
∂y
= 4
i=j
(pij − qij)
yi − yj
1 + yi − yj
2
(5)
positive → attraction, negative →
repulsion
(dissimilar DPs, similar MPs) → repulsion
repulsions do not go to inﬁnity
12 / 20

Good visualization
Implementation
Learning: Physical Analogy
∂C
∂y
= 4
i=j
(pij − qij)
yi − yj
1 + yi − yj
2
Physical Analogy: F = −k ∗ ∆x, attraction/repulsion
13 / 20

Good visualization
Implementation
Why t-Student for qij, instead of Gaussian?
Q: How many equidistant datapoints in 10 dimensions?
Crowding Problem: the area of the 2D map that is available to
accomodate moderately distant datapoints will not be large
enough compared with the area available to accommodate nearby
datapoints.
14 / 20

Good visualization
Implementation
t-SNE in sklearn
Follow example:
http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html
15 / 20

Good visualization
Implementation
Scalability: Barnes-Hut-SNE
Original t-SNE data and computational complexity is O(N2).
Limits 10K points.
Reduce complexity to O(N ∗ log(N)) via Barnes-Hut-SNE
(tree-based) algorithm. Up to tens of millions data points.
16 / 20

Good visualization
Implementation
Review of t-SNE for Images, Speach, Text
(Flash Player should be installed on Windows, to see the embedded video)
17 / 20

Good visualization
Implementation
Additional points
Q: Every time I run t-SNE, I get a (slightly) diﬀerent result?
Discussion: KL divergence in informative theory
Q: We want pij = pji and deﬁned pij =
pi|j +pj|i
2N . Why we
chose symmetric similarity metric?
Discussion: What is the best visualization method for
high-dimensional data so far?
Q: Is it feasible to use t-SNE to reduce a dataset to one
dimension?
A: yes
18 / 20

Good visualization
Implementation
Summary, Q&A
t-SNE is an eﬀective method to visualize a complex datasets
t-SNE exposes natural clusters
Implemented in many languages
Scalable with O(NlogN) version
19 / 20

Good visualization
Implementation
References
Laurens van der Maaten’ page: https://lvdmaaten.github.io/tsne/
Kevin Murphy ”Machine Learning: a Probabilistic Perspective”,
MIT, 2012
https://www.oreilly.com/learning/an-illustrated-introduction-to-the-
t-sne-algorithm
20 / 20

Visualizing Data Using t-SNE

More Related Content

What's hot (20)

Similar to Visualizing Data Using t-SNE (20)

Recently uploaded (20)

Visualizing Data Using t-SNE