SlideShare a Scribd company logo
1©MapR Technologies - Confidential
Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning
2©MapR Technologies - Confidential
Large-scale Single-pass k-Means
Clustering
3©MapR Technologies - Confidential
Large-scale k-Means Clustering
4©MapR Technologies - Confidential
Goals
 Cluster very large data sets
 Facilitate large nearest neighbor search
 Allow very large number of clusters
 Achieve good quality
– low average distance to nearest centroid on held-out data
 Based on Mahout Math
 Runs on Hadoop (really MapR) cluster
 FAST – cluster tens of millions in minutes
5©MapR Technologies - Confidential
Non-goals
 Use map-reduce (but it is there)
 Minimize the number of clusters
 Support metrics other than L2
6©MapR Technologies - Confidential
Anti-goals
 Multiple passes over original data
 Scale as O(k n)
7©MapR Technologies - Confidential
Why?
8©MapR Technologies - Confidential
K-nearest Neighbor with
Super Fast k-means
9©MapR Technologies - Confidential
What’s that?
 Find the k nearest training examples
 Use the average value of the target variable from them
 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
10©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
11©MapR Technologies - Confidential
How We Did It
 2 week hackathon with 6 developers from customer bank
 Agile-ish development
 To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
 Goal is new open technology to facilitate new closed solutions
 Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene
12©MapR Technologies - Confidential
What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
 Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices
 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
 Super-fast clustering
– Kmeans, StreamingKmeans
13©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
14©MapR Technologies - Confidential
How Many Projections?
15©MapR Technologies - Confidential
K-means Search
 Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
 Recursive application
– to search a cluster, use a Searcher!
16©MapR Technologies - Confidential
17©MapR Technologies - Confidential
x
18©MapR Technologies - Confidential
19©MapR Technologies - Confidential
20©MapR Technologies - Confidential
x
21©MapR Technologies - Confidential
But This Requires k-means!
 Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable
22©MapR Technologies - Confidential
Basic Method
 Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
 Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters
23©MapR Technologies - Confidential
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
24©MapR Technologies - Confidential
How It Works
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
25©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
26©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
27©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
28©MapR Technologies - Confidential
Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
 Empirically, projection search beats 64 bit LSH by a bit
29©MapR Technologies - Confidential
Moving to Scale
 Map-reduce implementation nearly trivial
 Map: rough-cluster input data, output ß, weighted centroids
 Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory
 Combiner possible, but essentially never important
30©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.mapr.com/ted-mlconf
Hash tags: #mlconf #mahout #mapr

More Related Content

PPTX
Berlin Hadoop Get Together Apache Drill
PDF
An effective classification approach for big data with parallel generalized H...
PDF
Ieeepro techno solutions 2013 ieee java project -building confidential and ...
PDF
Implementation of p pic algorithm in map reduce to handle big data
PPTX
Next generation analytics with yarn, spark and graph lab
PDF
PDF
A sql implementation on the map reduce framework
DOCX
Big data processing using - Hadoop Technology
Berlin Hadoop Get Together Apache Drill
An effective classification approach for big data with parallel generalized H...
Ieeepro techno solutions 2013 ieee java project -building confidential and ...
Implementation of p pic algorithm in map reduce to handle big data
Next generation analytics with yarn, spark and graph lab
A sql implementation on the map reduce framework
Big data processing using - Hadoop Technology

What's hot (19)

PDF
SSBSE10.ppt
PDF
Technical_Report_on_ML_Library
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
PPTX
Distributed Deep Learning + others for Spark Meetup
PPTX
Introduction to HADOOP
PDF
Web Oriented FIM for large scale dataset using Hadoop
PDF
A location based least-cost scheduling for data-intensive applications
PDF
Paper id 25201498
PDF
A novel scheduling algorithm for cloud computing environment
PDF
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
PDF
Shuffle phase as the bottleneck in Hadoop Terasort
PDF
useR 2014 jskim
PDF
Optimal buffer allocation in
PDF
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
DOCX
High performance intrusion detection using modified k mean & naïve bayes
PDF
IEEE Parallel and distributed system 2016 Title and Abstract
PDF
Earlier stage for straggler detection and handling using combined CPU test an...
SSBSE10.ppt
Technical_Report_on_ML_Library
Qo s aware scientific application scheduling algorithm in cloud environment
Python in an Evolving Enterprise System (PyData SV 2013)
Distributed Deep Learning + others for Spark Meetup
Introduction to HADOOP
Web Oriented FIM for large scale dataset using Hadoop
A location based least-cost scheduling for data-intensive applications
Paper id 25201498
A novel scheduling algorithm for cloud computing environment
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
Shuffle phase as the bottleneck in Hadoop Terasort
useR 2014 jskim
Optimal buffer allocation in
DGBSA : A BATCH JOB SCHEDULINGALGORITHM WITH GA WITH REGARD TO THE THRESHOLD ...
High performance intrusion detection using modified k mean & naïve bayes
IEEE Parallel and distributed system 2016 Title and Abstract
Earlier stage for straggler detection and handling using combined CPU test an...
Ad

Similar to Graphlab Ted Dunning Clustering (20)

PPTX
Boston Hug by Ted Dunning 2012
PPTX
New directions for mahout
PPTX
Boston hug-2012-07
PPTX
Graphlab dunning-clustering
PPTX
CMU Lecture on Hadoop Performance
PPTX
London Data Science - Super-Fast Clustering Report
PPTX
New Directions for Mahout
PPTX
News From Mahout
PPTX
Devoxx Real-Time Learning
PPTX
Buzz words-dunning-real-time-learning
PDF
ASE2010
PPTX
Strata New York 2012
PPTX
London hug
PDF
MapR M7: Providing an enterprise quality Apache HBase API
PPTX
The power of hadoop in business
PPTX
Predictive Analytics San Diego
PPTX
Cmu Lecture on Hadoop Performance
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Biomedical Signal and Image Analytics using MATLAB
PPTX
Speed up R with parallel programming in the Cloud
Boston Hug by Ted Dunning 2012
New directions for mahout
Boston hug-2012-07
Graphlab dunning-clustering
CMU Lecture on Hadoop Performance
London Data Science - Super-Fast Clustering Report
New Directions for Mahout
News From Mahout
Devoxx Real-Time Learning
Buzz words-dunning-real-time-learning
ASE2010
Strata New York 2012
London hug
MapR M7: Providing an enterprise quality Apache HBase API
The power of hadoop in business
Predictive Analytics San Diego
Cmu Lecture on Hadoop Performance
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Biomedical Signal and Image Analytics using MATLAB
Speed up R with parallel programming in the Cloud
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Machine Learning_overview_presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Getting Started with Data Integration: FME Form 101
Tartificialntelligence_presentation.pptx
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative study of natural language inference in Swahili using monolingua...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
cloud_computing_Infrastucture_as_cloud_p
NewMind AI Weekly Chronicles - August'25-Week II
OMC Textile Division Presentation 2021.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Network Security Unit 5.pdf for BCA BBA.
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Machine Learning_overview_presentation.pptx

Graphlab Ted Dunning Clustering

  • 1. 1©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering at Scale Ted Dunning
  • 2. 2©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering
  • 3. 3©MapR Technologies - Confidential Large-scale k-Means Clustering
  • 4. 4©MapR Technologies - Confidential Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes
  • 5. 5©MapR Technologies - Confidential Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2
  • 6. 6©MapR Technologies - Confidential Anti-goals  Multiple passes over original data  Scale as O(k n)
  • 7. 7©MapR Technologies - Confidential Why?
  • 8. 8©MapR Technologies - Confidential K-nearest Neighbor with Super Fast k-means
  • 9. 9©MapR Technologies - Confidential What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 10. 10©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup
  • 11. 11©MapR Technologies - Confidential How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene
  • 12. 12©MapR Technologies - Confidential What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans
  • 13. 13©MapR Technologies - Confidential Projection Search java.lang.TreeSet!
  • 14. 14©MapR Technologies - Confidential How Many Projections?
  • 15. 15©MapR Technologies - Confidential K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher!
  • 16. 16©MapR Technologies - Confidential
  • 17. 17©MapR Technologies - Confidential x
  • 18. 18©MapR Technologies - Confidential
  • 19. 19©MapR Technologies - Confidential
  • 20. 20©MapR Technologies - Confidential x
  • 21. 21©MapR Technologies - Confidential But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable
  • 22. 22©MapR Technologies - Confidential Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters
  • 23. 23©MapR Technologies - Confidential Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease
  • 24. 24©MapR Technologies - Confidential How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly
  • 25. 25©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 26. 26©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that!
  • 27. 27©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
  • 28. 28©MapR Technologies - Confidential Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit
  • 29. 29©MapR Technologies - Confidential Moving to Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but essentially never important
  • 30. 30©MapR Technologies - Confidential  Contact: – [email protected] – @ted_dunning  Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr