Introduction to theano, case study of Word Embeddings

Introduction to theano
Case study of word embedding models
Shashank
Gupta
MS, SIEL IIIT-H

Numpy is fast, why theano?
Bound by CPU
Bound by Python interpreter (less scope of runtime optimization like lazy
evaluation etc.)
Lack of symbolic differentiation
Finite difference method exists, but prone to numerical errors
SymPy supports symbolic differentiation but expression graphs not optimized as
theano

What’s theano?
Math expression compiler
Compiles a math expression to optimized C code
Targeted for GPU (CPU if GPU not supported)
Strongly typed as compared to python’s dynamically typed system
Works as GPU metaprogrammer (abstraction on top of GPU programming)

Alternative to theano
CUDA (PyCUDA it’s python wrapper)
Asynchronous Stochastic Gradient Descent - smart way to parallelize SGD
Main idea - Don’t create lock for one update, let threads run in async.
Could speed up ML algos more than matrix-vector parallelization

Automatic Differentiation
Crude way - Finite difference method
df / dx = (f(x + h) - f(x - h)) / 2*h
Prone to coding bugs
Prone to numerical error

Symbolic differentiation
Calculate gradient analytically
For a given math expression construct symbolic computation graph
Ref:
http://cs231n.github.io/optimization-2/

Backprop using symbolic differentiation
Once graph is computed forward pass is just flow of input to compute output
Backprop is going backwards from last node to compute gradients locally at each
node and accumulate those local gradients to compute global gradient (using
chain rule)
Ex :

Cont.
Each entity in the computation graph is a symbolic variable (a node in
computational graph)
Each operation is an ‘op’ node (theano specific)
‘Op’ node takes some inputs and produces some output
‘Apply’ node which applies op to inputs
‘Type’ nodes with associated type information of symbolic variable

Cont.
This is kind of abstract representation of math expression
Can think as intermediate representation in standard compiler phase
This is how theano represents the computational graph internally

Cont.
Ref: http://deeplearning.net/software/theano/extending/graphstructures.html

Optimization
Once this ‘intermediate representation’ is generated theano optimizes this graph
for efficient computation
Can think of it as compiler optimization phase
It reorders, remove redundant expression etc. to generate an ‘equivalent’ graph
which gives same output as input

Shared variables
Symbolic variables with predefined values
In Machine learning used to define parameters of the models with pre-defined
values (W, b matrices in NN)
Sends these values to host GPU with optimised storage

Theano functions
Compiles the ‘abstract’ computation graph to optimised C code
Compiles it targeting GPU
C code is highly optimized for numeric computation
Sort of interface between theano code and it’s calling python code

Case study : Word embedding models
1st Model : Autoencoder based word embedding
Ref : http://arxiv.org/pdf/1412.4930v2.pdf

Cont.
U is the embedding matrix which is learned by optimizing this objective

Code
Refer to github gist :
https://gist.github.com/shashankg7/aec2303803e7b39b150a9f78cb59db09
Only theano part included, I/O and preprocessing omitted

Model 2 - GloVe word embedding
Optimizes squared loss function
Challenges in practical implementation in theano
Main thing to remember - VECTORIZED implementation
How to handle large matrices

Model 3 : Skip-gram negative sampling
Alpha stage: Not able to figure out vectorization of loss function

Introduction to theano, case study of Word Embeddings

More Related Content

What's hot (20)

Similar to Introduction to theano, case study of Word Embeddings (20)

Recently uploaded (20)

Introduction to theano, case study of Word Embeddings