Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Machine Learning
to Engage the Customer
Chris.Biow@mongodb.com
@chris_biow

Trigger Warning
This presentation, and materials to
which it links, contains triggers.
These will be triggering reactive,
asynchronous, and message-driven
environments.
A safe room is available in Empire
West, where Alan Viars is
presenting Modernizing National
Health Care.

3
Objectionable Content
Language Impurities
• Basic Linear Algebra
Subprograms (BLAS - Fortran)
• Node-RED visual programming
• Node.js
• Scala, with Perlish accent
(Ehrmegerd, nerl perlish!)
• Java, C++, Prolog
• Twitter: unfiltered, live feed
• Machine recommendations
• Degenerate cases

Node-RED
Twitter with Watson Resonance

db.tweets.aggregate([
{$group: {
_id: {
hour: {$hour: "$date"},
minute: {$minute: "$date"}
},
total: {$sum: "$sentiment.score"},
average: {$avg: "$sentiment.score"},
count: {$sum: 1},
happyTalk: {$push: "$sentiment.positive"}
}},
{$unwind: "$happyTalk"},
{$unwind: "$happyTalk"},
{$group: {
_id: "$_id",
total: {$first: "$total"},
average: {$first: "$average"},
count: {$first: "$count"},
happyTalk: {$addToSet: "$happyTalk"}
}},
{$sort: {_id: -1} }
])

8
Machine Learning
• What: depends who you ask
– learning that is done by machines [my lab partner]
– algorithms that can learn from and make predictions on data [Wikipedia, just now]
– induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ]
– whatever the heck we’re selling [cloud vendors]
– common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson]
– predictive analytics [Microsoft Azure, AWS]
– algorithmic grab-bag [Mahout, MLlib]
• Why: depends what you want
– Engagement, discovery, decision [Watson]
– Prediction: maintenance, demand, resource allocation [Azure]
– Analytics: fraud, personalization, marketing, churn, support [AWS]

9
Apache Mahout Samsara
• Architectures: standalone, MapReduce, Spark, H20
• Languages: DSL shell, Java
• Functions
– Collaborative filtering
– Classification
– Clustering
– Dimensionality reduction
– Topic models
– Miscellany
Example: Create topic grouping for Wikipedia articles

10
Spark MLlib
• Languages: Scala, Java, Python
• Clusters: EC2, YARN, Mesos, standalone
• Linear algebra: Java Breeze / Fortran BLAS
• Data: vector, point, matrix
• Functions
– Basic stats
– Classification and regression
– Collaborative Filtering
– Clustering
– Dimensionality reduction (remove variables)
– Feature extraction & transformation
– Frequent pattern mining
– Optimization (local min/max)
Example: interactive drill-down categories for large result set

11
The Magic of Alternating Least Squares
Latent Factoring
Which is the real me?
Movies recommended for you:
1: The Sound of Music (1965)
2: Snow White and the Seven Dwarfs (1937)
3: Beauty and the Beast (1991)
4: Charlie Brown Christmas, A (1965)
5: Bambi (1942)
6: Seven Brides for Seven Brothers (1954)
7: Mary Poppins (1964)
8: Pinocchio (1940)
9: Gone with the Wind (1939)
10: The Wizard of Oz (1939)
Movies recommended for you:
1: Maradona by Kusturica (2008)
2: Shadows of Forgotten Ancestors (1964)
3: Rosario Tijeras (2005)
4: Constantine's Sword (2007)
5: Titicut Follies (1967)
6: Lady Chatterley (2006)
7: August Evening (2007)
8: Power of Nightmares: The Rise of the
Politics of Fear, The (2004)
9: Sun Alley (Sonnenallee) (1999)
10: Who's Singin' Over There? (a.k.a. Who
Sings Over There) (Ko to tamo peva) (1980)

12
Watson Developer Cloud
• Presented as services for Bluemix
• RESTful calls
• Node.js
• Node-RED
Example: Message resonance for
email solicitation

13
Microsoft Azure
• R and Python
• Flowchart GUI
• Correlation, modeling, trend projection, forecasting
• HDInsight cloud Hadoop
• Publishing for profit via Machine Learning Gallery
– Voice recognition
– Customer churn prediction
– Text extraction: sentiment and key phrase
– Contributor donation propensity
– Frequently bought together
– Classifier
– Clustering
– Linear regression
– … 35 total in market [goo.gl/LhMbUu]
Example: Retail forecasting

14
AWS
• Create models
• Generate predictions
• Data: S3, Redshift, RDS
• APIs: Java, .NET, Python, PHP, Node, Ruby
• Mobile SDK
• Use cases
– Fraud detection
– Content personalization
– Marketing propensity modeling
– Document classification
– Customer churn prediction
– Customer support solutions
Example: Marketing response prediction

15
MongoDB
• Next-gen database
– Document-model
– Scalable
– Highly-available
– Secondary indexes
• Agile with schema and query types
• Subsecond query response over multiple indexes
• Low-second aggregation framework for basic analytics
Example: Number of articles by author
• In-database mapReduce
• Hadoop connector
– Mongo[Input|Output]Format
– mongo.[input|output].uri or BSON
– mongo.input.query
Agility Aggregation Framework
Documents
High Availability Secondary Indexing
Scalability

16
MongoDB Data Operations Spectrum
• Retrieve Nothing – infinitely fast
• Document Retrieval – 1ms if in cache, ~10ms from spinning disk
• .find() – per-document cost similar to single document
– _id range
– any secondary index range, can be composite key
– intersect two indexes
– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered
• .aggregate() – retrieval cost like find, plus pipeline operations
– $match
– $group
– $project
– $redact
• .mapReduce() – in-database Javascript
• Hadoop Connector
– mongo.input.query for indexed partial scan
– full scan
Faster…………….....Slower

19
Topic Detection
• Grouping documents according to topics, especially over time
– Google News
• Latent Dirichlet Allocation
– Corpus of M documents, each of N words
Wij at position i in document j
– Documents have (latent) topic distributions α
θi for document i
– Topics have word distributions β, φk for topic k
Zij is topic contributing to word at position j in document i
– Remove stopwords!
• Tweets
– Large, terse corpus
– Highly sensitive to number of iterations
(10 returned little more than word distribution)
– Requires some iterative stopwording
"Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png
"Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public
Domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png

*
* Form C := alpha*A**H*B + beta*C.
*
DO 120 J = 1,N
DO 110 I = 1,M
TEMP = ZERO
DO 100 L = 1,K
TEMP = TEMP + CONJG(A(L,I))*B(L,J)
100 CONTINUE
IF (BETA.EQ.ZERO) THEN
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
110 CONTINUE
120 CONTINUE
ELSE
*
* Form C := alpha*A**T*B + beta*C
*
DO 150 J = 1,N
DO 140 I = 1,M
TEMP = ZERO
DO 130 L = 1,K
TEMP = TEMP + A(L,I)*B(L,J)
130 CONTINUE
C(I,J) = ALPHA*TEMP
ELSE
C(I,J) = ALPHA*TEMP + BETA*C(I,J)
END IF
140 CONTINUE
150 CONTINUE
END IF
ELSE IF (NOTA) THEN
IF (CONJB) THEN
*
* Form C := alpha*A*B**H + beta*C.
*
DO 200 J = 1,N
DO 160 I = 1,M
C(I,J) = ZERO
160 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 170 I = 1,M
C(I,J) = BETA*C(I,J)
170 CONTINUE
END IF
DO 190 L = 1,K
IF (B(J,L).NE.ZERO) THEN
TEMP = ALPHA*CONJG(B(J,L))
DO 180 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
180 CONTINUE
END IF
190 CONTINUE
200 CONTINUE
ELSE
*
* Form C := alpha*A*B**T + beta*C
*
DO 250 J = 1,N
DO 210 I = 1,M
C(I,J) = ZERO

Create the Resilient Distributed Dataset (RDD)
rdd = sc.newAPIHadoopRDD(
config, MongoInputFormat.class, Object.class, BSONObject.class)
config.set(
"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")
config.set(
"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')
config.set(
"mongo.output.uri",
"mongodb://127.0.0.1:27017/marketdata.fiveminutebars")
val minBarRawRDD = sc.newAPIHadoopRDD(
config,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Object],
classOf[BSONObject])

val fiveMinBars = groupBars.map(
g => (
g.head.get("_id"),
new BasicBSONObject(g.head.toMap()).
append("Close", g.last.get("Close") ).
append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).
append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).
append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )
)
)
Operate through Spark on the RDD Object

// Create a separate Configuration for saving data back to MongoDB.
val outputConfig = new Configuration()
outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")
outputConfig.set("mongo.output.uri", "mongodb://"
+ mongoPort
+ "/marketdata.fiveminutebars")
fiveMinBars.saveAsNewAPIHadoopFile(
"file:///dummy",
classOf[Any],
classOf[Any],
classOf[MongoOutputFormat[_,_]],
outputConfig)
Put It Back Where You Found It

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Sollicitudin VenenatisLOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
LOREM
IPSUM
Graphic Element Examples

Porta Ultricies
Commodo Porta
Graph Examples

{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
Code/Highlight Example

Aggregation Framework Agility Backup Big Data Briefcase
Buildings Business Intelligence Camera Cash Register Catalog
Chat Checkmark Checkmark Cloud Commercial Contract
Computer Content Continuous Development Credit Card Customer Success

Data Center Data Variety Data Velocity Data Volume Data Warehouse Database
Dialogue Directory Documents Downloads Drivers Dynamic Schema
EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop
Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development

Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass
Man Mobile Phone Meter Monitoring Music New Apps
New Data Types Online Open Source Parachute Personalization Pin
Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying

Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel
Stopwatch Text Search Tick Data Training Transmission Tower Trophy
Woman World

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

More Related Content

What's hot (20)

Similar to Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB (20)

More from MongoDB (20)

Recently uploaded (20)

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB