SlideShare a Scribd company logo
Elasticsearch & Lucene for
Apache Spark and MLlib
Costin Leau (@costinl)
Mirror, mirror on the wall,
what’s the happiest team of
us all ?
Briita Weber
- Rough translation from German by yours truly -
Purpose of the talk
Improve ML pipelines through IR
Text processing
• Analysis
• Featurize/Vectorize *
* In research / poc / WIP / Experimental phase
Technical Debt
Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al
http://research.google.com/pubs/pub43146.html
Technical Debt
Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al
http://research.google.com/pubs/pub43146.html
Challenge
Challenge
Challenge: What team at Elastic is most happy?
Data: Hipchat messages
Training / Test data: http://www.sentiment140.com
Result: Kibana dashboard
ML Pipeline
Chat data
Sentiment	Model
Production	Data Apply	the	rule Predict	the	‘class’
J /	L
Data is King
Example: Word2Vec
Input snippet
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#example
it was introduced into mathematics in the book
disquisitiones arithmeticae by carl friedrich gauss in
one eight zero one ever since however modulo has gained
many meanings some exact and some imprecise
Real data is messy
originally looked like this:
https://en.wikipedia.org/wiki/Modulo_(jargon)
It was introduced into <a
href="https://en.wikipedia.org/wiki/Mathematics"
title="Mathematics">mathematics</a> in the book <i><a
href="https://en.wikipedia.org/wiki/Disquisitiones_Arithmeticae"
title="Disquisitiones Arithmeticae">Disquisitiones Arithmeticae</a></i>
by <a href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"
title="Carl Friedrich Gauss">Carl Friedrich Gauss</a> in 1801. Ever
since, however, "modulo" has gained many meanings, some exact and some
imprecise.
Feature extraction Cleaning up data
"huuuuuuunnnnnnngrrryyy",
"aaaaaamaaazinggggg",
"aaaaaamazing",
"aaaaaammm",
"aaaaaammmazzzingggg",
"aaaaaamy",
"aaaaaan",
"aaaaaand",
"aaaaaannnnnnddd",
"aaaaaanyways"
Does it help to clean that up?
see “Twitter Sentiment Classification using Distant Supervision”, Go et al.
http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
Language matters
读书须用意,一字值千金
Lucene to the rescue!
High-performance, full-featured text search library
15 years of experience
Widely recognized for its utility
• It’s a primary test bed for new JVM versions
Text processing
Character	
Filter
Tokenizer
Token	FilterToken	FilterToken	Filter
Do <b>Johnny
Depp</b> a favor and
forget you…
Do
Pos:	1
Johnny
Pos:	2
do
Pos:	1
johnny
Pos:	2
Lucene for text analysis
state of the art text processing
many extensions available for different languages, use cases,…
however…
…
import org.apache.lucene.analysis…
…
Analyzer a = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
return new TokenStreamComponents(tokenizer, tokenizer);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
};
TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>");
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
int pos = 0;
while (stream.incrementToken()) {
pos += posIncrement.getPositionIncrement();
System.out.println(term.toString() + " " + pos);
}
> some 1
> text 2
…
import org.apache.lucene.analysis…
…
Analyzer a = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
return new TokenStreamComponents(tokenizer, tokenizer);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
};
TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>");
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
int pos = 0;
while (stream.incrementToken()) {
pos += posIncrement.getPositionIncrement();
System.out.println(term.toString() + " " + pos);
}
> some 1
> text 2
How about a declarative approach?
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Very quick intro to
Elasticsearch
Elasticsearch in 5 3’
Scalable, real-time search and analytics engine
Data distribution, cluster management
REST APIs
JVM based, uses Apache Lucene internally
Open-source (on Github, Apache 2 License)
Elasticsearch in 3’
Unstructured
search
Elasticsearch in 3’
Sorting / Scoring
Elasticsearch in 3’
Pagination
Elasticsearch in 3’
Enrichment
Elasticsearch in 3’
Structured
search
Elasticsearch in 3’
https://www.elastic.co/elasticon/2015/sf/unlocking-interplanetary-datasets-with-real-time-search
Machine Learning and
Elasticsearch
Machine Learning and Elasticsearch
Machine Learning and Elasticsearch
Term Analysis (tf, idf, bm25)
Graph Analysis
Co-occurrence of Terms (significant terms)
• ChiSquare
Pearson correlation (#16817)
Regression (#17154)
What about classification/clustering/ etc… ?
31
It’s not the matching data,
but the meta that lead to it
How to use Elasticsearch
from Spark ?
Somebody on Stackoverflow
Elasticsearch for Apache Hadoop ™
Elasticsearch for Apache Hadoop ™
Elasticsearch for Apache Hadoop ™
Elasticsearch Spark – Native integration
Scala & Java API
Understands Scala & Java types
– Case classes
– Java Beans
Available as Spark package
Supports Spark Core & SQL
all 1.x version (1.0-1.6)
Available for Scala 2.10 and 2.11
Elasticsearch as RDD / Dataset*
import org.elasticsearch.spark._
val sc = new SparkContext(new SparkConf())
val rdd = sc.esRDD(“buckethead/albums", "?q=pikes")
import org.elasticsearch.spark._
case class Artist(name: String, albums: Int)
val u2 = Artist("U2", 13)
val bh = Map("name"->"Buckethead","albums" -> 255, "age" -> 46)
sc.makeRDD(Seq(u2, bh)).saveToEs("radio/artists")
Elasticsearch as a DataFrame
val df = sql.read.format(“es").load("buckethead/albums")
df.filter(df("category").equalTo("pikes").and(df("year").geq(2015)))
{ "query" :
{ "bool" : { "must" : [
"match" : { "category" : "pikes" }
],
"filter" : [
{ "range" : { "year" : {"gte" : "2015" }}}
]
}}}
Partition to Partition Architecture
Putting the pieces together
Typical ML pipeline for text
Typical ML pipeline for text
Actual ML code
Typical ML pipeline for text
Pure Spark MLlib
val training = movieReviewsDataTrainingData
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
Pure Spark MLlib
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val analyzer = new ESAnalyzer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val analyzer = new ESAnalyzer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Data movement
Work once – reuse multiple times
// index / analyze the data
training.saveToEs("movies/reviews")
Work once – reuse multiple times
// prepare the spec for vectorize – fast and lightweight
val spec = s"""{ "features" : [{
| "field": "text",
| "type" : "string",
| "tokens" : "all_terms",
| "number" : "occurrence",
| "min_doc_freq" : 2000
| }],
| "sparse" : "true"}""".stripMargin
ML.prepareSpec(spec, “my-spec”)
Access the vector directly
// get the features – just another query
val payload = s"""{"script_fields" : { "vector" :
| { "script" : { "id" : “my-spec","lang" : “doc_to_vector" } }
| }}""".stripMargin
// index the data
vectorRDD = sparkCtx.esRDD("ml/data", payload)
// feed the vector to the pipeline
val vectorized = vectorRDD.map ( x =>
// get indices, the vector and length
(if (x._1 == "negative") 0.0d else 1.0d, ML.getVectorFrom(x._2))
).toDF("label", "features")
Revised ML pipeline
val vectorized = vectorRDD.map...
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val model = lr.fit(vectorized)
Simplify ML pipeline
Once per dataset,
regardless of # of
pipelines
Raw data is not
required any more
Need to adjust the model? Change the spec
val spec = s"""{ "features" : [{
| "field": "text",
| "type" : "string",
| "tokens" : "given",
| "number" : "tf",
| "terms": ["term1", "term2", ...]
| }],
| "sparse" : "true"}""".stripMargin
ML.prepareSpec(spec)
Elasticsearch And Apache Lucene For Apache Spark And MLlib
All this is WIP
Not all features available (currently dictionary, vectors)
Works with data outside or inside Elasticsearch (latter is much faster)
Bind vectors to queries
Other topics WIP:
Focused on document / text classification – numeric support is next
Model importing / exporting – Spark 2.0 ML persistence
Feedback highly sought - Is this useful?
THANK YOU.
j.mp/spark-summit-west-16
elastic.co/hadoop
github.com/elastic | costin | brwe
discuss.elastic.co
@costinl

More Related Content

What's hot (20)

Introduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Spark on YARN
Spark on YARN
Adarsh Pannu
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Introduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
Apache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 

Similar to Elasticsearch And Apache Lucene For Apache Spark And MLlib (20)

Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Terrastore - A document database for developers
Terrastore - A document database for developers
Sergio Bossa
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
Just one-shade-of-openstack
Just one-shade-of-openstack
Roberto Polli
 
ElasticSearch for .NET Developers
ElasticSearch for .NET Developers
Ben van Mol
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Spark what's new what's coming
Spark what's new what's coming
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Terrastore - A document database for developers
Terrastore - A document database for developers
Sergio Bossa
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
Just one-shade-of-openstack
Just one-shade-of-openstack
Roberto Polli
 
ElasticSearch for .NET Developers
ElasticSearch for .NET Developers
Ben van Mol
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Spark what's new what's coming
Spark what's new what's coming
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Ad

More from Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark on Mesos
Spark on Mesos
Jen Aman
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark on Mesos
Spark on Mesos
Jen Aman
 
Ad

Recently uploaded (20)

MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 

Elasticsearch And Apache Lucene For Apache Spark And MLlib

  • 1. Elasticsearch & Lucene for Apache Spark and MLlib Costin Leau (@costinl)
  • 2. Mirror, mirror on the wall, what’s the happiest team of us all ? Briita Weber - Rough translation from German by yours truly -
  • 3. Purpose of the talk Improve ML pipelines through IR Text processing • Analysis • Featurize/Vectorize * * In research / poc / WIP / Experimental phase
  • 4. Technical Debt Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al http://research.google.com/pubs/pub43146.html
  • 5. Technical Debt Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al http://research.google.com/pubs/pub43146.html
  • 7. Challenge: What team at Elastic is most happy? Data: Hipchat messages Training / Test data: http://www.sentiment140.com Result: Kibana dashboard
  • 8. ML Pipeline Chat data Sentiment Model Production Data Apply the rule Predict the ‘class’ J / L
  • 10. Example: Word2Vec Input snippet http://spark.apache.org/docs/latest/mllib-feature-extraction.html#example it was introduced into mathematics in the book disquisitiones arithmeticae by carl friedrich gauss in one eight zero one ever since however modulo has gained many meanings some exact and some imprecise
  • 11. Real data is messy originally looked like this: https://en.wikipedia.org/wiki/Modulo_(jargon) It was introduced into <a href="https://en.wikipedia.org/wiki/Mathematics" title="Mathematics">mathematics</a> in the book <i><a href="https://en.wikipedia.org/wiki/Disquisitiones_Arithmeticae" title="Disquisitiones Arithmeticae">Disquisitiones Arithmeticae</a></i> by <a href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss" title="Carl Friedrich Gauss">Carl Friedrich Gauss</a> in 1801. Ever since, however, "modulo" has gained many meanings, some exact and some imprecise.
  • 12. Feature extraction Cleaning up data "huuuuuuunnnnnnngrrryyy", "aaaaaamaaazinggggg", "aaaaaamazing", "aaaaaammm", "aaaaaammmazzzingggg", "aaaaaamy", "aaaaaan", "aaaaaand", "aaaaaannnnnnddd", "aaaaaanyways" Does it help to clean that up? see “Twitter Sentiment Classification using Distant Supervision”, Go et al. http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
  • 14. Lucene to the rescue! High-performance, full-featured text search library 15 years of experience Widely recognized for its utility • It’s a primary test bed for new JVM versions
  • 15. Text processing Character Filter Tokenizer Token FilterToken FilterToken Filter Do <b>Johnny Depp</b> a favor and forget you… Do Pos: 1 Johnny Pos: 2 do Pos: 1 johnny Pos: 2
  • 16. Lucene for text analysis state of the art text processing many extensions available for different languages, use cases,… however…
  • 17. … import org.apache.lucene.analysis… … Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new StandardTokenizer(); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new HTMLStripCharFilter(reader); } }; TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>"); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class); stream.reset(); int pos = 0; while (stream.incrementToken()) { pos += posIncrement.getPositionIncrement(); System.out.println(term.toString() + " " + pos); } > some 1 > text 2
  • 18. … import org.apache.lucene.analysis… … Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new StandardTokenizer(); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new HTMLStripCharFilter(reader); } }; TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>"); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class); stream.reset(); int pos = 0; while (stream.incrementToken()) { pos += posIncrement.getPositionIncrement(); System.out.println(term.toString() + " " + pos); } > some 1 > text 2 How about a declarative approach?
  • 20. Very quick intro to Elasticsearch
  • 21. Elasticsearch in 5 3’ Scalable, real-time search and analytics engine Data distribution, cluster management REST APIs JVM based, uses Apache Lucene internally Open-source (on Github, Apache 2 License)
  • 29. Machine Learning and Elasticsearch
  • 30. Machine Learning and Elasticsearch Term Analysis (tf, idf, bm25) Graph Analysis Co-occurrence of Terms (significant terms) • ChiSquare Pearson correlation (#16817) Regression (#17154) What about classification/clustering/ etc… ?
  • 31. 31 It’s not the matching data, but the meta that lead to it
  • 32. How to use Elasticsearch from Spark ? Somebody on Stackoverflow
  • 36. Elasticsearch Spark – Native integration Scala & Java API Understands Scala & Java types – Case classes – Java Beans Available as Spark package Supports Spark Core & SQL all 1.x version (1.0-1.6) Available for Scala 2.10 and 2.11
  • 37. Elasticsearch as RDD / Dataset* import org.elasticsearch.spark._ val sc = new SparkContext(new SparkConf()) val rdd = sc.esRDD(“buckethead/albums", "?q=pikes") import org.elasticsearch.spark._ case class Artist(name: String, albums: Int) val u2 = Artist("U2", 13) val bh = Map("name"->"Buckethead","albums" -> 255, "age" -> 46) sc.makeRDD(Seq(u2, bh)).saveToEs("radio/artists")
  • 38. Elasticsearch as a DataFrame val df = sql.read.format(“es").load("buckethead/albums") df.filter(df("category").equalTo("pikes").and(df("year").geq(2015))) { "query" : { "bool" : { "must" : [ "match" : { "category" : "pikes" } ], "filter" : [ { "range" : { "year" : {"gte" : "2015" }}} ] }}}
  • 39. Partition to Partition Architecture
  • 40. Putting the pieces together
  • 42. Typical ML pipeline for text Actual ML code
  • 44. Pure Spark MLlib val training = movieReviewsDataTrainingData val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training)
  • 45. Pure Spark MLlib val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 46. Pure Spark MLlib val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 47. Pure Spark MLlib val analyzer = new ESAnalyzer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 48. Pure Spark MLlib val analyzer = new ESAnalyzer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 50. Work once – reuse multiple times // index / analyze the data training.saveToEs("movies/reviews")
  • 51. Work once – reuse multiple times // prepare the spec for vectorize – fast and lightweight val spec = s"""{ "features" : [{ | "field": "text", | "type" : "string", | "tokens" : "all_terms", | "number" : "occurrence", | "min_doc_freq" : 2000 | }], | "sparse" : "true"}""".stripMargin ML.prepareSpec(spec, “my-spec”)
  • 52. Access the vector directly // get the features – just another query val payload = s"""{"script_fields" : { "vector" : | { "script" : { "id" : “my-spec","lang" : “doc_to_vector" } } | }}""".stripMargin // index the data vectorRDD = sparkCtx.esRDD("ml/data", payload) // feed the vector to the pipeline val vectorized = vectorRDD.map ( x => // get indices, the vector and length (if (x._1 == "negative") 0.0d else 1.0d, ML.getVectorFrom(x._2)) ).toDF("label", "features")
  • 53. Revised ML pipeline val vectorized = vectorRDD.map... val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val model = lr.fit(vectorized)
  • 54. Simplify ML pipeline Once per dataset, regardless of # of pipelines Raw data is not required any more
  • 55. Need to adjust the model? Change the spec val spec = s"""{ "features" : [{ | "field": "text", | "type" : "string", | "tokens" : "given", | "number" : "tf", | "terms": ["term1", "term2", ...] | }], | "sparse" : "true"}""".stripMargin ML.prepareSpec(spec)
  • 57. All this is WIP Not all features available (currently dictionary, vectors) Works with data outside or inside Elasticsearch (latter is much faster) Bind vectors to queries Other topics WIP: Focused on document / text classification – numeric support is next Model importing / exporting – Spark 2.0 ML persistence Feedback highly sought - Is this useful?