SlideShare a Scribd company logo
© 2015 IBM Corporation
Interactive Analytics Using Apache Spark
Bangalore Spark Enthusiasts Group
http://www.meetup.com/Bangalore-Spark-Enthusiasts/
1
Bagavath Subramaniam, IBM Analytics
Shally Sangal, IBM Analytics
© 2015 IBM Corporation
Agenda
▪ Overview of Interactive Analytics
▪ Spark Application User Types
▪ Spark Context
▪ Spark Shell
▪ Spark Submit
▪ Spark JDBC Thrift Server
▪ Apache Zeppelin
▪ Jupyter
▪ Spark Kernel
▪ Spark Job Server
▪ Livy
2
© 2015 IBM Corporation
Spark Application User Types
Data Scientist
▪ Data Exploration
▪ Data Wrangling
▪ Build Models from Data using Algorithms - Predict/Prescribe
▪ Knowledge in Statistics & Maths
▪ R/Python, Matlab/SPSS
▪ Ad-hoc analysis using Interactive Shells
Data Analyst
▪ Data Exploration and Visualization
▪ Understands data sources and relationships among them in an Enterprise
▪ Relates data to business and derives insights, can talk business language
▪ May have basic programming skills and analytic tools knowledge
▪ Ad-hoc analysis using canned reports
▪ Limited usage of interactive shells
3
© 2015 IBM Corporation
Typical User Roles
Business Analyst
▪ Industry Expert
▪ Understand business needs and works on solutions
▪ Improve business processes and design new systems to support them
▪ Not a programmer / Analytics expert
▪ Typical user of reporting systems
Data Engineer / Application Developer
▪ Programmer with S/W Engineering background
▪ Builds production data pipelines, data warehouses, reporting solutions and apps
▪ Productionize models built by data scientists
▪ Builds s/w applications to solve business problems
▪ Maintains, Monitors, Tunes data processing platform and applications
> Roles are often fluid and overlapping
4
© 2015 IBM Corporation
Interactive Tools for Spark
Apache Spark
IBM Spark Kernel
(Apache Toree)
Cloudera
Livy
Ooyala
Spark Job Server
5
© 2015 IBM Corporation
User and Tools
Primary set of tools for each role
Spark
Shell
Spark
Submit
Thrift
JDBC
Server
Zeppelin Spark
Kernel
Jupyter Livy Hue
Data Scientist
Data Analyst
Developer
Business Analyst
6
Spark Job
Server
© 2015 IBM Corporation
Spark Context
▪ Common thread for all Spark Interfaces
▪ Main entry point for Spark, represents the connection to a Spark cluster
▪ Standalone, Yarn, Mesos, Local
▪ Holds all the configuration - memory, cores, parallelism, compression
▪ Create RDDs, accumulators, broadcast variables
▪ Run Jobs, Cancel Jobs
▪ One Spark Context per JVM limitation, one application Id
▪ Supports parallel jobs from separate threads
▪ Scheduler mode - FIFO / Fair (within an Application)
▪ Fair Scheduler
− Pools - spark.scheduler.pool
− Weights/Priorities, Scheduling Mode, Weights
7
© 2015 IBM Corporation
Spark shell
Spark Shell
▪ Interactive shell ( spark-shell for scala, pyspark for python)
▪ spark-shell is based on scala REPL
▪ Instantiates Spark Context by default, also a UI
▪ Also gives sqlContext which is hiveContext
▪ Internally calls spark-submit
{SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name Spark shell
▪ All parameters of Spark submit can be passed to Spark shell as well
8
© 2015 IBM Corporation
Spark submit
▪ Launch/Submit a spark application to a spark cluster
▪ org.apache.spark.launcher.Main gets called with org.apache.spark.deploy.SparkSubmit
as a parameter along with the other params passed to spark-submit
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit "$@"
▪ spark-submit --help => list of supported parameters
▪ kill a job ( spark-submit --kill) and get job status (spark-submit --status)
▪ spark-defaults.conf in SPARK_CONF_DIR
▪ Precedence - Explicit set on SparkConf, flags passed to spark-submit, values from
defaults
9
© 2015 IBM Corporation
▪ Web based notebook for interactive analytics.
▪ Provides built in Spark integration.
▪ Supports many interpreters such as Scala,
Pyspark,SparkSQL, Hive, Shell etc.
▪ It starts a Zeppelin server.
▪ Spawns one JVM per interpreter group.
▪ Server communicates with the Interpreter
Group using Thrift.
10
Apache Zeppelin
© 2015 IBM Corporation
Zeppelin Demo
To know more : http://zeppelin.incubator.apache.org/
11
© 2015 IBM Corporation
Jupyter Notebook
Web notebook for interactive data
analysis.Part of Jupyter ecosystem.
Evolved from IPython, works on the
IPython messaging Protocol.
Has the concept of Kernels - any
language kernel can be plugged in
which implements the protocol.
Spark kernel is available via Apache
Toree.
12
© 2015 IBM Corporation
Jupyter Notebook Demo
To know more : http://jupyter.org/
13
© 2015 IBM Corporation
Spark Kernel ( Apache Toree )
Kernel provides the foundation for interactive
applications to connect to use Spark.
Provides an interface that allows clients to
interact with a Spark Cluster. Clients can send
code snippets and libraries that are interpreted
and run against a pre configured SparkContext.
Acts as a proxy between your application and
the Spark Cluster.
14
© 2015 IBM Corporation
Kernel Architecture
Kernel uses ZeroMQ as its messaging
middleware using TCP sockets
and implements the IPython
message protocol.
It is architected in layers, where each
layer has a specific purpose
in processing of requests.
Provides concurrency and code
isolation by use of Akka
framework.
15
© 2015 IBM Corporation
How does it talk to Spark?
Kernel is launched by a spark-submit process. It works with local spark, Standalone Spark
Cluster as well as Spark with Yarn.
SPARK_HOME is a mandatory environment variable needed.
SPARK_OPTS is an optional environment variable - we can use to configure spark master,
deploy mode, driver memory, number of executors etc.
Uses the same Scala Interpreter as Spark shell. The interpreter holds a Spark Context and
the class server uri used to host compiled code.
16
© 2015 IBM Corporation
How to communicate with Kernel
Two forms of communication :
1. Client library for code execution
1. Directly talk to Kernel like Jupyter notebook
17
© 2015 IBM Corporation
Kernel Client Library
Written in Scala. Eliminates need to understand ZeroMQ
message protocol.
Enables treating the kernel as a remote service.
Shares majority of its code with the kernel’s
codebase.
Two steps to using the client :
1. Initialize the client with the connection details of the kernel.
2. Use the execute API to run code snippets with attached
callbacks.
18
© 2015 IBM Corporation
How to run Kernel and Client:
https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel
https://github.com/ibm-et/spark-kernel/wiki/Guide-for-the-Spark-Kernel-Client
CODE DEMO
19
© 2015 IBM Corporation
Comm API
As part of the IPython message protocol, the Comm API allows developers to specify
custom messages to communicate data and perform actions on both the frontend(client) and
backend(kernel). This API is useful in scenarios where we want to do same actions for the
messages. Either client or kernel can start sending messages.
20
© 2015 IBM Corporation
Livy
Livy is an open source REST
interface for interacting with
Spark. It supports executing
code snippets of Python,
Scala, R.
It is currently used to power
the Spark snippets of Hadoop
Notebook in Hue.
Multiple contexts by using
multiple sessions or multiple
users to same session.
21
© 2015 IBM Corporation
LIVY CODE EXECUTION DEMO
To know more : https://github.com/cloudera/hue/tree/master/apps/spark/java
22
© 2015 IBM Corporation
Spark Job Server
JobServer provides a REST interface for submitting and managing Spark jobs/jars.
It is intended to be run as one or more independent processes, separate from the Spark
cluster or within the spark cluster. It works with Mesos as well as Yarn.
It supports multiple Spark Context. Runs SparkContext in their own forked JVM process.
This is available via a config parameter spark.jobserver.context-per-jvm. It is by default set
to false for local development mode, but recommended to be set to true for production
deployment.
It exposes APIs to upload your jars, get contexts, run jobs, get data, configure contexts etc.
It used Spray, Akka actors , Akka Cluster for separate contexts.
23
© 2015 IBM Corporation
JOB SERVER DEMO
To know more : https://github.com/spark-jobserver/spark-jobserver
24
Ad

Recommended

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
Spark Summit
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
DataWorks Summit
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Comparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Migrating pipelines into Docker
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin Helium and Beyond
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 
Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 

More Related Content

What's hot (20)

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Comparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Migrating pipelines into Docker
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin Helium and Beyond
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 
Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
DataWorks Summit
 
Comparison of various streaming technologies
Comparison of various streaming technologies
Sachin Aggarwal
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Databricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
Spark Summit
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 

Viewers also liked (20)

Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
Cloudera, Inc.
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Hadoop spark performance comparison
Hadoop spark performance comparison
arunkumar sadhasivam
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
andyseaborne
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
Cloudera, Inc.
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Hadoop spark performance comparison
Hadoop spark performance comparison
arunkumar sadhasivam
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
gethue
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
andyseaborne
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
Jen Aman
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Ad

Similar to Interactive Analytics using Apache Spark (20)

The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
DIY Analytics with Apache Spark
DIY Analytics with Apache Spark
Adam Roberts
 
2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Introduction to pyspark new
Introduction to pyspark new
Anam Mahmood
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
DIY Analytics with Apache Spark
DIY Analytics with Apache Spark
Adam Roberts
 
2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Introduction to pyspark new
Introduction to pyspark new
Anam Mahmood
 
Ad

Recently uploaded (20)

PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 

Interactive Analytics using Apache Spark

  • 1. © 2015 IBM Corporation Interactive Analytics Using Apache Spark Bangalore Spark Enthusiasts Group http://www.meetup.com/Bangalore-Spark-Enthusiasts/ 1 Bagavath Subramaniam, IBM Analytics Shally Sangal, IBM Analytics
  • 2. © 2015 IBM Corporation Agenda ▪ Overview of Interactive Analytics ▪ Spark Application User Types ▪ Spark Context ▪ Spark Shell ▪ Spark Submit ▪ Spark JDBC Thrift Server ▪ Apache Zeppelin ▪ Jupyter ▪ Spark Kernel ▪ Spark Job Server ▪ Livy 2
  • 3. © 2015 IBM Corporation Spark Application User Types Data Scientist ▪ Data Exploration ▪ Data Wrangling ▪ Build Models from Data using Algorithms - Predict/Prescribe ▪ Knowledge in Statistics & Maths ▪ R/Python, Matlab/SPSS ▪ Ad-hoc analysis using Interactive Shells Data Analyst ▪ Data Exploration and Visualization ▪ Understands data sources and relationships among them in an Enterprise ▪ Relates data to business and derives insights, can talk business language ▪ May have basic programming skills and analytic tools knowledge ▪ Ad-hoc analysis using canned reports ▪ Limited usage of interactive shells 3
  • 4. © 2015 IBM Corporation Typical User Roles Business Analyst ▪ Industry Expert ▪ Understand business needs and works on solutions ▪ Improve business processes and design new systems to support them ▪ Not a programmer / Analytics expert ▪ Typical user of reporting systems Data Engineer / Application Developer ▪ Programmer with S/W Engineering background ▪ Builds production data pipelines, data warehouses, reporting solutions and apps ▪ Productionize models built by data scientists ▪ Builds s/w applications to solve business problems ▪ Maintains, Monitors, Tunes data processing platform and applications > Roles are often fluid and overlapping 4
  • 5. © 2015 IBM Corporation Interactive Tools for Spark Apache Spark IBM Spark Kernel (Apache Toree) Cloudera Livy Ooyala Spark Job Server 5
  • 6. © 2015 IBM Corporation User and Tools Primary set of tools for each role Spark Shell Spark Submit Thrift JDBC Server Zeppelin Spark Kernel Jupyter Livy Hue Data Scientist Data Analyst Developer Business Analyst 6 Spark Job Server
  • 7. © 2015 IBM Corporation Spark Context ▪ Common thread for all Spark Interfaces ▪ Main entry point for Spark, represents the connection to a Spark cluster ▪ Standalone, Yarn, Mesos, Local ▪ Holds all the configuration - memory, cores, parallelism, compression ▪ Create RDDs, accumulators, broadcast variables ▪ Run Jobs, Cancel Jobs ▪ One Spark Context per JVM limitation, one application Id ▪ Supports parallel jobs from separate threads ▪ Scheduler mode - FIFO / Fair (within an Application) ▪ Fair Scheduler − Pools - spark.scheduler.pool − Weights/Priorities, Scheduling Mode, Weights 7
  • 8. © 2015 IBM Corporation Spark shell Spark Shell ▪ Interactive shell ( spark-shell for scala, pyspark for python) ▪ spark-shell is based on scala REPL ▪ Instantiates Spark Context by default, also a UI ▪ Also gives sqlContext which is hiveContext ▪ Internally calls spark-submit {SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name Spark shell ▪ All parameters of Spark submit can be passed to Spark shell as well 8
  • 9. © 2015 IBM Corporation Spark submit ▪ Launch/Submit a spark application to a spark cluster ▪ org.apache.spark.launcher.Main gets called with org.apache.spark.deploy.SparkSubmit as a parameter along with the other params passed to spark-submit org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit "$@" ▪ spark-submit --help => list of supported parameters ▪ kill a job ( spark-submit --kill) and get job status (spark-submit --status) ▪ spark-defaults.conf in SPARK_CONF_DIR ▪ Precedence - Explicit set on SparkConf, flags passed to spark-submit, values from defaults 9
  • 10. © 2015 IBM Corporation ▪ Web based notebook for interactive analytics. ▪ Provides built in Spark integration. ▪ Supports many interpreters such as Scala, Pyspark,SparkSQL, Hive, Shell etc. ▪ It starts a Zeppelin server. ▪ Spawns one JVM per interpreter group. ▪ Server communicates with the Interpreter Group using Thrift. 10 Apache Zeppelin
  • 11. © 2015 IBM Corporation Zeppelin Demo To know more : http://zeppelin.incubator.apache.org/ 11
  • 12. © 2015 IBM Corporation Jupyter Notebook Web notebook for interactive data analysis.Part of Jupyter ecosystem. Evolved from IPython, works on the IPython messaging Protocol. Has the concept of Kernels - any language kernel can be plugged in which implements the protocol. Spark kernel is available via Apache Toree. 12
  • 13. © 2015 IBM Corporation Jupyter Notebook Demo To know more : http://jupyter.org/ 13
  • 14. © 2015 IBM Corporation Spark Kernel ( Apache Toree ) Kernel provides the foundation for interactive applications to connect to use Spark. Provides an interface that allows clients to interact with a Spark Cluster. Clients can send code snippets and libraries that are interpreted and run against a pre configured SparkContext. Acts as a proxy between your application and the Spark Cluster. 14
  • 15. © 2015 IBM Corporation Kernel Architecture Kernel uses ZeroMQ as its messaging middleware using TCP sockets and implements the IPython message protocol. It is architected in layers, where each layer has a specific purpose in processing of requests. Provides concurrency and code isolation by use of Akka framework. 15
  • 16. © 2015 IBM Corporation How does it talk to Spark? Kernel is launched by a spark-submit process. It works with local spark, Standalone Spark Cluster as well as Spark with Yarn. SPARK_HOME is a mandatory environment variable needed. SPARK_OPTS is an optional environment variable - we can use to configure spark master, deploy mode, driver memory, number of executors etc. Uses the same Scala Interpreter as Spark shell. The interpreter holds a Spark Context and the class server uri used to host compiled code. 16
  • 17. © 2015 IBM Corporation How to communicate with Kernel Two forms of communication : 1. Client library for code execution 1. Directly talk to Kernel like Jupyter notebook 17
  • 18. © 2015 IBM Corporation Kernel Client Library Written in Scala. Eliminates need to understand ZeroMQ message protocol. Enables treating the kernel as a remote service. Shares majority of its code with the kernel’s codebase. Two steps to using the client : 1. Initialize the client with the connection details of the kernel. 2. Use the execute API to run code snippets with attached callbacks. 18
  • 19. © 2015 IBM Corporation How to run Kernel and Client: https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel https://github.com/ibm-et/spark-kernel/wiki/Guide-for-the-Spark-Kernel-Client CODE DEMO 19
  • 20. © 2015 IBM Corporation Comm API As part of the IPython message protocol, the Comm API allows developers to specify custom messages to communicate data and perform actions on both the frontend(client) and backend(kernel). This API is useful in scenarios where we want to do same actions for the messages. Either client or kernel can start sending messages. 20
  • 21. © 2015 IBM Corporation Livy Livy is an open source REST interface for interacting with Spark. It supports executing code snippets of Python, Scala, R. It is currently used to power the Spark snippets of Hadoop Notebook in Hue. Multiple contexts by using multiple sessions or multiple users to same session. 21
  • 22. © 2015 IBM Corporation LIVY CODE EXECUTION DEMO To know more : https://github.com/cloudera/hue/tree/master/apps/spark/java 22
  • 23. © 2015 IBM Corporation Spark Job Server JobServer provides a REST interface for submitting and managing Spark jobs/jars. It is intended to be run as one or more independent processes, separate from the Spark cluster or within the spark cluster. It works with Mesos as well as Yarn. It supports multiple Spark Context. Runs SparkContext in their own forked JVM process. This is available via a config parameter spark.jobserver.context-per-jvm. It is by default set to false for local development mode, but recommended to be set to true for production deployment. It exposes APIs to upload your jars, get contexts, run jobs, get data, configure contexts etc. It used Spray, Akka actors , Akka Cluster for separate contexts. 23
  • 24. © 2015 IBM Corporation JOB SERVER DEMO To know more : https://github.com/spark-jobserver/spark-jobserver 24