SlideShare a Scribd company logo
Machine Learning
to Engage the Customer
Chris.Biow@mongodb.com
@chris_biow
Trigger Warning
This presentation, and materials to
which it links, contains triggers.
These will be triggering reactive,
asynchronous, and message-driven
environments.
A safe room is available in Empire
West, where Alan Viars is
presenting Modernizing National
Health Care.
3
Objectionable Content
Language Impurities
• Basic Linear Algebra
Subprograms (BLAS - Fortran)
• Node-RED visual programming
• Node.js
• Scala, with Perlish accent
(Ehrmegerd, nerl perlish!)
• Java, C++, Prolog
• Twitter: unfiltered, live feed
• Machine recommendations
• Degenerate cases
Let’s Try It
Node-RED
Twitter with Watson Resonance
db.tweets.aggregate([
{$group: {
_id: {
hour: {$hour: "$date"},
minute: {$minute: "$date"}
},
total: {$sum: "$sentiment.score"},
average: {$avg: "$sentiment.score"},
count: {$sum: 1},
happyTalk: {$push: "$sentiment.positive"}
}},
{$unwind: "$happyTalk"},
{$unwind: "$happyTalk"},
{$group: {
_id: "$_id",
total: {$first: "$total"},
average: {$first: "$average"},
count: {$first: "$count"},
happyTalk: {$addToSet: "$happyTalk"}
}},
{$sort: {_id: -1} }
])
The What and the Why
8
Machine Learning
• What: depends who you ask
– learning that is done by machines [my lab partner]
– algorithms that can learn from and make predictions on data [Wikipedia, just now]
– induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ]
– whatever the heck we’re selling [cloud vendors]
– common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson]
– predictive analytics [Microsoft Azure, AWS]
– algorithmic grab-bag [Mahout, MLlib]
• Why: depends what you want
– Engagement, discovery, decision [Watson]
– Prediction: maintenance, demand, resource allocation [Azure]
– Analytics: fraud, personalization, marketing, churn, support [AWS]
9
Apache Mahout Samsara
• Architectures: standalone, MapReduce, Spark, H20
• Languages: DSL shell, Java
• Functions
– Collaborative filtering
– Classification
– Clustering
– Dimensionality reduction
– Topic models
– Miscellany
Example: Create topic grouping for Wikipedia articles
10
Spark MLlib
• Languages: Scala, Java, Python
• Clusters: EC2, YARN, Mesos, standalone
• Linear algebra: Java Breeze / Fortran BLAS
• Data: vector, point, matrix
• Functions
– Basic stats
– Classification and regression
– Collaborative Filtering
– Clustering
– Dimensionality reduction (remove variables)
– Feature extraction & transformation
– Frequent pattern mining
– Optimization (local min/max)
Example: interactive drill-down categories for large result set
11
The Magic of Alternating Least Squares
Latent Factoring
Which is the real me?
Movies recommended for you:
1: The Sound of Music (1965)
2: Snow White and the Seven Dwarfs (1937)
3: Beauty and the Beast (1991)
4: Charlie Brown Christmas, A (1965)
5: Bambi (1942)
6: Seven Brides for Seven Brothers (1954)
7: Mary Poppins (1964)
8: Pinocchio (1940)
9: Gone with the Wind (1939)
10: The Wizard of Oz (1939)
Movies recommended for you:
1: Maradona by Kusturica (2008)
2: Shadows of Forgotten Ancestors (1964)
3: Rosario Tijeras (2005)
4: Constantine's Sword (2007)
5: Titicut Follies (1967)
6: Lady Chatterley (2006)
7: August Evening (2007)
8: Power of Nightmares: The Rise of the
Politics of Fear, The (2004)
9: Sun Alley (Sonnenallee) (1999)
10: Who's Singin' Over There? (a.k.a. Who
Sings Over There) (Ko to tamo peva) (1980)
12
Watson Developer Cloud
• Presented as services for Bluemix
• RESTful calls
• Node.js
• Node-RED
Example: Message resonance for
email solicitation
13
Microsoft Azure
• R and Python
• Flowchart GUI
• Correlation, modeling, trend projection, forecasting
• HDInsight cloud Hadoop
• Publishing for profit via Machine Learning Gallery
– Voice recognition
– Customer churn prediction
– Text extraction: sentiment and key phrase
– Contributor donation propensity
– Frequently bought together
– Classifier
– Clustering
– Linear regression
– … 35 total in market [goo.gl/LhMbUu]
Example: Retail forecasting
14
AWS
• Create models
• Generate predictions
• Data: S3, Redshift, RDS
• APIs: Java, .NET, Python, PHP, Node, Ruby
• Mobile SDK
• Use cases
– Fraud detection
– Content personalization
– Marketing propensity modeling
– Document classification
– Customer churn prediction
– Customer support solutions
Example: Marketing response prediction
15
MongoDB
• Next-gen database
– Document-model
– Scalable
– Highly-available
– Secondary indexes
• Agile with schema and query types
• Subsecond query response over multiple indexes
• Low-second aggregation framework for basic analytics
Example: Number of articles by author
• In-database mapReduce
• Hadoop connector
– Mongo[Input|Output]Format
– mongo.[input|output].uri or BSON
– mongo.input.query
Agility Aggregation Framework
Documents
High Availability Secondary Indexing
Scalability
16
MongoDB Data Operations Spectrum
• Retrieve Nothing – infinitely fast
• Document Retrieval – 1ms if in cache, ~10ms from spinning disk
• .find() – per-document cost similar to single document
– _id range
– any secondary index range, can be composite key
– intersect two indexes
– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered
• .aggregate() – retrieval cost like find, plus pipeline operations
– $match
– $group
– $project
– $redact
• .mapReduce() – in-database Javascript
• Hadoop Connector
– mongo.input.query for indexed partial scan
– full scan
Faster…………….....Slower
17
Using Spark
19
Topic Detection
• Grouping documents according to topics, especially over time
– Google News
• Latent Dirichlet Allocation
– Corpus of M documents, each of N words
Wij at position i in document j
– Documents have (latent) topic distributions α
θi for document i
– Topics have word distributions β, φk for topic k
Zij is topic contributing to word at position j in document i
– Remove stopwords!
• Tweets
– Large, terse corpus
– Highly sensitive to number of iterations
(10 returned little more than word distribution)
– Requires some iterative stopwording
"Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png
"Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public
Domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png
*
* Form C := alpha*A**H*B + beta*C.
*
DO 120 J = 1,N
DO 110 I = 1,M
TEMP = ZERO
DO 100 L = 1,K
TEMP = TEMP + CONJG(A(L,I))*B(L,J)
100 CONTINUE
IF (BETA.EQ.ZERO) THEN
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
110 CONTINUE
120 CONTINUE
ELSE
*
* Form C := alpha*A**T*B + beta*C
*
DO 150 J = 1,N
DO 140 I = 1,M
TEMP = ZERO
DO 130 L = 1,K
TEMP = TEMP + A(L,I)*B(L,J)
130 CONTINUE
IF (BETA.EQ.ZERO) THEN
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
140 CONTINUE
150 CONTINUE
END IF
ELSE IF (NOTA) THEN
IF (CONJB) THEN
*
* Form C := alpha*A*B**H + beta*C.
*
DO 200 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 160 I = 1,M
C(I,J) = ZERO
160 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 170 I = 1,M
C(I,J) = BETA*C(I,J)
170 CONTINUE
END IF
DO 190 L = 1,K
IF (B(J,L).NE.ZERO) THEN
TEMP = ALPHA*CONJG(B(J,L))
DO 180 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
180 CONTINUE
END IF
190 CONTINUE
200 CONTINUE
ELSE
*
* Form C := alpha*A*B**T + beta*C
*
DO 250 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 210 I = 1,M
C(I,J) = ZERO
Create the Resilient Distributed Dataset (RDD)
rdd = sc.newAPIHadoopRDD(
config, MongoInputFormat.class, Object.class, BSONObject.class)
config.set(
"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")
config.set(
"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')
config.set(
"mongo.output.uri",
"mongodb://127.0.0.1:27017/marketdata.fiveminutebars")
val minBarRawRDD = sc.newAPIHadoopRDD(
config,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Object],
classOf[BSONObject])
val fiveMinBars = groupBars.map(
g => (
g.head.get("_id"),
new BasicBSONObject(g.head.toMap()).
append("Close", g.last.get("Close") ).
append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).
append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).
append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )
)
)
Operate through Spark on the RDD Object
// Create a separate Configuration for saving data back to MongoDB.
val outputConfig = new Configuration()
outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")
outputConfig.set("mongo.output.uri", "mongodb://"
+ mongoPort
+ "/marketdata.fiveminutebars")
fiveMinBars.saveAsNewAPIHadoopFile(
"file:///dummy",
classOf[Any],
classOf[Any],
classOf[MongoOutputFormat[_,_]],
outputConfig)
Put It Back Where You Found It
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Sollicitudin VenenatisLOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Graphic Element Examples
Porta Ultricies
Commodo Porta
Graph Examples
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
Code/Highlight Example
Aggregation Framework Agility Backup Big Data Briefcase
Buildings Business Intelligence Camera Cash Register Catalog
Chat Checkmark Checkmark Cloud Commercial Contract
Computer Content Continuous Development Credit Card Customer Success
Data Center Data Variety Data Velocity Data Volume Data Warehouse Database
Dialogue Directory Documents Downloads Drivers Dynamic Schema
EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop
Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development
Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass
Man Mobile Phone Meter Monitoring Music New Apps
New Data Types Online Open Source Parachute Personalization Pin
Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying
Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel
Stopwatch Text Search Tick Data Training Transmission Tower Trophy
Woman World

More Related Content

PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PPTX
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
PDF
Using MongoDB + Hadoop Together
PPTX
MongoDB and Hadoop: Driving Business Insights
PPTX
MongoDB et Hadoop
PDF
Applied Machine learning using H2O, python and R Workshop
ODP
MongoDB & Machine Learning
KEY
Cascalog
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Using MongoDB + Hadoop Together
MongoDB and Hadoop: Driving Business Insights
MongoDB et Hadoop
Applied Machine learning using H2O, python and R Workshop
MongoDB & Machine Learning
Cascalog

What's hot (20)

PPTX
MongoDB & Hadoop - Understanding Your Big Data
PPTX
MongoDB and Hadoop: Driving Business Insights
PDF
Performance comparison: Multi-Model vs. MongoDB and Neo4j
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
PDF
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PPTX
Introduction to Graph Databases
PPTX
Solr 6.0 Graph Query Overview
PDF
Data Science with Spark
PDF
Spark and MongoDB
PPTX
Webinar: Live Data Visualisation with Tableau and MongoDB
PDF
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
PPTX
Seattle Scalability Mahout
PDF
Data modeling for Elasticsearch
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PPTX
OWF 2014 - Take back control of your Web tracking - Dataiku
PPTX
Benefits of Using MongoDB Over RDBMSs
PPTX
Webinar: The Anatomy of the Cloudant Data Layer
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PPTX
Building Spring Data with MongoDB
MongoDB & Hadoop - Understanding Your Big Data
MongoDB and Hadoop: Driving Business Insights
Performance comparison: Multi-Model vs. MongoDB and Neo4j
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Introduction to Graph Databases
Solr 6.0 Graph Query Overview
Data Science with Spark
Spark and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDB
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Seattle Scalability Mahout
Data modeling for Elasticsearch
Making Machine Learning Scale: Single Machine and Distributed
OWF 2014 - Take back control of your Web tracking - Dataiku
Benefits of Using MongoDB Over RDBMSs
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Building Spring Data with MongoDB
Ad

Similar to Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB (20)

PDF
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PPTX
Where are yours vertexes and what are they talking about?
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PDF
Scalding big ADta
PPTX
Introduction to Azure DocumentDB
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
KEY
Managing Social Content with MongoDB
PDF
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
KEY
Mongodb intro
PPTX
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
PDF
Simplifying & accelerating application development with MongoDB's intelligent...
PPT
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
PDF
Sorry - How Bieber broke Google Cloud at Spotify
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
PPTX
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
PPTX
Einführung in MongoDB
PDF
Babar: Knowledge Recognition, Extraction and Representation
KEY
Getting Started on Hadoop
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Where are yours vertexes and what are they talking about?
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Scalding big ADta
Introduction to Azure DocumentDB
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Managing Social Content with MongoDB
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Mongodb intro
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Simplifying & accelerating application development with MongoDB's intelligent...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Sorry - How Bieber broke Google Cloud at Spotify
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
Einführung in MongoDB
Babar: Knowledge Recognition, Extraction and Representation
Getting Started on Hadoop
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

  • 2. Trigger Warning This presentation, and materials to which it links, contains triggers. These will be triggering reactive, asynchronous, and message-driven environments. A safe room is available in Empire West, where Alan Viars is presenting Modernizing National Health Care.
  • 3. 3 Objectionable Content Language Impurities • Basic Linear Algebra Subprograms (BLAS - Fortran) • Node-RED visual programming • Node.js • Scala, with Perlish accent (Ehrmegerd, nerl perlish!) • Java, C++, Prolog • Twitter: unfiltered, live feed • Machine recommendations • Degenerate cases
  • 6. db.tweets.aggregate([ {$group: { _id: { hour: {$hour: "$date"}, minute: {$minute: "$date"} }, total: {$sum: "$sentiment.score"}, average: {$avg: "$sentiment.score"}, count: {$sum: 1}, happyTalk: {$push: "$sentiment.positive"} }}, {$unwind: "$happyTalk"}, {$unwind: "$happyTalk"}, {$group: { _id: "$_id", total: {$first: "$total"}, average: {$first: "$average"}, count: {$first: "$count"}, happyTalk: {$addToSet: "$happyTalk"} }}, {$sort: {_id: -1} } ])
  • 7. The What and the Why
  • 8. 8 Machine Learning • What: depends who you ask – learning that is done by machines [my lab partner] – algorithms that can learn from and make predictions on data [Wikipedia, just now] – induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ] – whatever the heck we’re selling [cloud vendors] – common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson] – predictive analytics [Microsoft Azure, AWS] – algorithmic grab-bag [Mahout, MLlib] • Why: depends what you want – Engagement, discovery, decision [Watson] – Prediction: maintenance, demand, resource allocation [Azure] – Analytics: fraud, personalization, marketing, churn, support [AWS]
  • 9. 9 Apache Mahout Samsara • Architectures: standalone, MapReduce, Spark, H20 • Languages: DSL shell, Java • Functions – Collaborative filtering – Classification – Clustering – Dimensionality reduction – Topic models – Miscellany Example: Create topic grouping for Wikipedia articles
  • 10. 10 Spark MLlib • Languages: Scala, Java, Python • Clusters: EC2, YARN, Mesos, standalone • Linear algebra: Java Breeze / Fortran BLAS • Data: vector, point, matrix • Functions – Basic stats – Classification and regression – Collaborative Filtering – Clustering – Dimensionality reduction (remove variables) – Feature extraction & transformation – Frequent pattern mining – Optimization (local min/max) Example: interactive drill-down categories for large result set
  • 11. 11 The Magic of Alternating Least Squares Latent Factoring Which is the real me? Movies recommended for you: 1: The Sound of Music (1965) 2: Snow White and the Seven Dwarfs (1937) 3: Beauty and the Beast (1991) 4: Charlie Brown Christmas, A (1965) 5: Bambi (1942) 6: Seven Brides for Seven Brothers (1954) 7: Mary Poppins (1964) 8: Pinocchio (1940) 9: Gone with the Wind (1939) 10: The Wizard of Oz (1939) Movies recommended for you: 1: Maradona by Kusturica (2008) 2: Shadows of Forgotten Ancestors (1964) 3: Rosario Tijeras (2005) 4: Constantine's Sword (2007) 5: Titicut Follies (1967) 6: Lady Chatterley (2006) 7: August Evening (2007) 8: Power of Nightmares: The Rise of the Politics of Fear, The (2004) 9: Sun Alley (Sonnenallee) (1999) 10: Who's Singin' Over There? (a.k.a. Who Sings Over There) (Ko to tamo peva) (1980)
  • 12. 12 Watson Developer Cloud • Presented as services for Bluemix • RESTful calls • Node.js • Node-RED Example: Message resonance for email solicitation
  • 13. 13 Microsoft Azure • R and Python • Flowchart GUI • Correlation, modeling, trend projection, forecasting • HDInsight cloud Hadoop • Publishing for profit via Machine Learning Gallery – Voice recognition – Customer churn prediction – Text extraction: sentiment and key phrase – Contributor donation propensity – Frequently bought together – Classifier – Clustering – Linear regression – … 35 total in market [goo.gl/LhMbUu] Example: Retail forecasting
  • 14. 14 AWS • Create models • Generate predictions • Data: S3, Redshift, RDS • APIs: Java, .NET, Python, PHP, Node, Ruby • Mobile SDK • Use cases – Fraud detection – Content personalization – Marketing propensity modeling – Document classification – Customer churn prediction – Customer support solutions Example: Marketing response prediction
  • 15. 15 MongoDB • Next-gen database – Document-model – Scalable – Highly-available – Secondary indexes • Agile with schema and query types • Subsecond query response over multiple indexes • Low-second aggregation framework for basic analytics Example: Number of articles by author • In-database mapReduce • Hadoop connector – Mongo[Input|Output]Format – mongo.[input|output].uri or BSON – mongo.input.query Agility Aggregation Framework Documents High Availability Secondary Indexing Scalability
  • 16. 16 MongoDB Data Operations Spectrum • Retrieve Nothing – infinitely fast • Document Retrieval – 1ms if in cache, ~10ms from spinning disk • .find() – per-document cost similar to single document – _id range – any secondary index range, can be composite key – intersect two indexes – covered indexes even faster • .count(), .distinct(), .group() – fast, may be covered • .aggregate() – retrieval cost like find, plus pipeline operations – $match – $group – $project – $redact • .mapReduce() – in-database Javascript • Hadoop Connector – mongo.input.query for indexed partial scan – full scan Faster…………….....Slower
  • 17. 17
  • 19. 19 Topic Detection • Grouping documents according to topics, especially over time – Google News • Latent Dirichlet Allocation – Corpus of M documents, each of N words Wij at position i in document j – Documents have (latent) topic distributions α θi for document i – Topics have word distributions β, φk for topic k Zij is topic contributing to word at position j in document i – Remove stopwords! • Tweets – Large, terse corpus – Highly sensitive to number of iterations (10 returned little more than word distribution) – Requires some iterative stopwording "Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png "Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public Domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png
  • 20. * * Form C := alpha*A**H*B + beta*C. * DO 120 J = 1,N DO 110 I = 1,M TEMP = ZERO DO 100 L = 1,K TEMP = TEMP + CONJG(A(L,I))*B(L,J) 100 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 110 CONTINUE 120 CONTINUE ELSE * * Form C := alpha*A**T*B + beta*C * DO 150 J = 1,N DO 140 I = 1,M TEMP = ZERO DO 130 L = 1,K TEMP = TEMP + A(L,I)*B(L,J) 130 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 140 CONTINUE 150 CONTINUE END IF ELSE IF (NOTA) THEN IF (CONJB) THEN * * Form C := alpha*A*B**H + beta*C. * DO 200 J = 1,N IF (BETA.EQ.ZERO) THEN DO 160 I = 1,M C(I,J) = ZERO 160 CONTINUE ELSE IF (BETA.NE.ONE) THEN DO 170 I = 1,M C(I,J) = BETA*C(I,J) 170 CONTINUE END IF DO 190 L = 1,K IF (B(J,L).NE.ZERO) THEN TEMP = ALPHA*CONJG(B(J,L)) DO 180 I = 1,M C(I,J) = C(I,J) + TEMP*A(I,L) 180 CONTINUE END IF 190 CONTINUE 200 CONTINUE ELSE * * Form C := alpha*A*B**T + beta*C * DO 250 J = 1,N IF (BETA.EQ.ZERO) THEN DO 210 I = 1,M C(I,J) = ZERO
  • 21. Create the Resilient Distributed Dataset (RDD) rdd = sc.newAPIHadoopRDD( config, MongoInputFormat.class, Object.class, BSONObject.class) config.set( "mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars") config.set( "mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}') config.set( "mongo.output.uri", "mongodb://127.0.0.1:27017/marketdata.fiveminutebars") val minBarRawRDD = sc.newAPIHadoopRDD( config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
  • 22. val fiveMinBars = groupBars.map( g => ( g.head.get("_id"), new BasicBSONObject(g.head.toMap()). append("Close", g.last.get("Close") ). append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ). append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ). append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) ) ) ) Operate through Spark on the RDD Object
  • 23. // Create a separate Configuration for saving data back to MongoDB. val outputConfig = new Configuration() outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat") outputConfig.set("mongo.output.uri", "mongodb://" + mongoPort + "/marketdata.fiveminutebars") fiveMinBars.saveAsNewAPIHadoopFile( "file:///dummy", classOf[Any], classOf[Any], classOf[MongoOutputFormat[_,_]], outputConfig) Put It Back Where You Found It
  • 28. { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] } Code/Highlight Example
  • 29. Aggregation Framework Agility Backup Big Data Briefcase Buildings Business Intelligence Camera Cash Register Catalog Chat Checkmark Checkmark Cloud Commercial Contract Computer Content Continuous Development Credit Card Customer Success
  • 30. Data Center Data Variety Data Velocity Data Volume Data Warehouse Database Dialogue Directory Documents Downloads Drivers Dynamic Schema EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development
  • 31. Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass Man Mobile Phone Meter Monitoring Music New Apps New Data Types Online Open Source Parachute Personalization Pin Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying
  • 32. Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel Stopwatch Text Search Tick Data Training Transmission Tower Trophy Woman World