Disclaimer
Presentations are intended for educational
purposes only and do not replace
independent professional judgment.
Statements of fact and opinions expressed
are those of the participants individually
and don’t necessarily reflect those of
blibli.com.
Blibli.com does not endorse or approve,
and assumes no responsibility for, the
content, accuracy or completeness of the
information presented.
Python, data science, and
unsupervised learning
Hendri Karisma
[email protected] /
[email protected]Hendri Karisma
• Sr. Research and Development
Engineer at blibli.com (PT. Global
Digital Niaga)
• Rnd Team for Machine Learning
• Working for Fraud Detection System.
Current working in dynamic
recommendation system project.
Definition of Informatics
“Automation of Information” –
Prof. Dr. Ing. Iping Supriana
Solution Approachment
• Analytical (Exact)
Example :
– analytics solution :
– Numerical solution
– Error = | 7.25 – 22/3| = |7.25-7.33|=0.08333
• Numerical (Aprox)
– Is numerical methods just about ML method that we know in the
book?
– Newton raphson, Gauss Elimination, Gauss-Jordan, Jacobi method,
Gauss-Seidel, Lagrange, Newton Gregory, Richardson Interpolation,
etc.
Machine Learning Definition
“A computer program is said to learn
from experience E with respect to
some class of tasks T and performance
measure P, if its performance at tasks
in T, as measured by P, improves with
experience E.” – Prof. Tom Mitchel
How it works
Machine Learning Perspective
● Information Theory (Decission Tree :
ID-Tree, C4.5, etc)
● Probability (Bayessian : Naive
Bayes, Belief Network, etc)
● Graphical Model (Belief network, HMM,
CRF, Neural Network, etc)
● Numerical Method or Regression
(Stochastic Gradient Descent/Ascent:
Linear Regression, Multiple Linear
Regression, Neural Network, E-M
Algorithm, HMM)
Machine Learning
• Supervised
• Unsupervised
• Reinforcement Learning
• Semi-Supervised
• Deep Learning
The four layer of data mining
Tools/libs in python
● Numpy ● Other Tech (to
● Scipy support ML) :
● Pandas – Apache Kafka
● Scikit-learn – Apache Spark
● Matplotlib – Db : mongo, postgre
● seaborn – elasticsearch
● Tensorflow – CUDA/OpenCL
*pydata.org
*anaconda
Numpy, scipy, padas, and sk-learn
● Numpy & scipy: Arrays, Indexing, Slicing,
and Iterating, Reshaping, Shallow vs deep
copy, Broadcasting, Indexing (advanced),
Matrices, Matrix decompositions, Scipy on
top numpy
● Pandas : Reading data, Selecting columns
and rows, Filtering, Vectorized string
operations, Missing values, Handling time,
Time series, On top numpy.
● SK-Learn : Feature extraction, Classification,
Regression, Clustering, Dimension reduction,
Model selection
What we do in blibli using python
● Data flow
● Data pooling
● Data preprocessing
● Machine Learning Service/app
Our system that using python for ML
● Personalize recommendation system
● Data engineering (especially the
data flow for ML engine)
● Machine learning engine
● Fraud detection experiments
EM Algorithms
Repeat until convergence{
}
What??
EM Algorithms
There are 3 keys that (as far as I know) almost
always used in EM-Algorithm :
● Data Distribution
● Maximum Likelihood Estimation (MLE)
● Estimation-Maximization (EM)
*Today we will use the Gaussian distribution for
sample case
EM Algorithms
The algorithm has 2 main steps just like the name
of the algorithm:
– Expectation :
– Maximization:
*repeat until get maximum likelihood :
Gaussian Distribution
Gaussian Distribution
Gaussian Multivariate
● Gaussian Distribution :
● Gaussian Distribution Multivariate :
Mixture Gaussian
EM-Algorithm for Mixture Gaussian
● Expectation :
● Maximization :
*Log likelihood :
Fraud – without target class/labels
● These are anomalous data
● Anomaly data usually have one or
some small group of data
● A lot of features without labels
------------------------------------------
● We need unsupervised algorithm
(EM-Algorithm)
Case Anomaly Detection
● Credit Card data with fraudulant data.
Case Anomaly Detection
Case Anomaly Detection
Case Anomaly Detection
Case Anomaly Detection
Problem Performance
Distributed System/Scale Out
Supervisor/Service
Using python
Python script
Presistence
Computation
THANK YOU
Any question?
*we are hiring*