SlideShare a Scribd company logo
IoT Applications and Patterns
using Apache Spark &
Apache Bahir
Luciano Resende
June 14th, 2018
© 2018 IBM Corporation 1
About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
Open Source Community Leadership
C O D A I T
Founding Partner 188+ Project Committers 77+ Projects
Key Open source steering committee
memberships OSS Advisory Board
Open Source
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
5
Agenda
6
Introductions
- Apache Spark
- Apache Bahir
IoT Applications
Live Demo
Summary
References
Q&A
Apache Spark
7
Apache Spark Introduction
8
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB
Apache Spark Evolution
9
Apache Spark – Spark SQL
10
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
Apache Spark – Spark SQL
11
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset.
12
Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
13
Apache Spark – Spark SQL – Data Sources
Data Sources under the covers
- Data source registration (e.g. spark.read.datasource)
- Provide BaseRelation implementation
• That implements support for table scans:
– TableScans, PrunedScan, PrunedFilteredScan, CatalystScan
- Detailed information available at
• https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/
14
Apache Spark – Spark SQL – Data Sources
Data Sources V1 Limitations
- Leak upper-level API in the data source (DataFrame/SQLContext)
- Hard to extend the Data Sources API for more optimizations
- Zero transaction guarantee in the write APIs
- Limited Extensibility
15
Apache Spark – Spark SQL – Data Sources
Data Sources V2
- Support for row-based scan and columnar scan
- Column pruning and filter push-down
- Can report basic statistics and data partitioning
- Transactional write API
- Streaming source and sink support for micro-batch and continuous
mode
- Detailed information available at
• https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/
16
Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as unbounded
table
Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
18
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
Apache Spark – Spark Streaming
19
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required
Apache Spark – Spark Streaming
Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
20
Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
21
Apache Bahir
22
Origins of the Apache Bahir Project
MAY/2016: Established as a top-level Apache Project.
- PMC formed by Apache Spark committers/pmc, Apache Members
- Initial contributions imported from Apache Spark
AUG/2016: Apache Flink community join Apache Bahir
- Initial contributions of Flink extensions
- In October 2016 Robert Metzger elected committer
Origins of the Bahir name
Naming an Apache Project is a science !!!
- We needed a name that wasn’t used yet
- Needed to be related to Spark
We ended up with : Bahir
- A name of Arabian origin that means Sparkling,
- Also associated with a guy who succeeds at everything
Why Apache Bahir
It’s an Apache project
- And if you are here, you know what it means
Benefits of curating your extensions at Apache Bahir
- Apache Governance
- Apache License
- Apache Community
- Apache Brand
25
Why Apache Bahir
Flexibility
- Release flexibility
• Bounded to platform or component release
Shared infrastructure
- Release, CI, etc
Shared knowledge
- Collaborate with experts on both platform and component areas
26
Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
27
Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
28
Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
29
Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
30
Internet of
Things - IoT
31
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of
physical devices, vehicles, home
appliances, and other
items embedded with electronics, software,
sensors, actuators, and network
connectivity which enable these objects to
connect and exchange data.
32
IoT – Interaction between multiple entities
33
Things Software
People
control
observe
inform
command
actuate
inform
IoT Patterns – Some of them …
35
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
MQTT – M2M / IoT Connectivity Protocol
37
Connect
+
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open spec
+ 40 client
implementatio
ns
Minimal
overhead
Tiny
Clients
(Java 170KB)
History
Header
2-4 bytes
(publish)
14 bytes
(connect)
V5
May 2018
MQTT – Quality of Service
38
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messanger
39
Live Demo
40
IoT Simulator using MQTT
The demo environment
https://github.com/lresende/bahir-iot-demo
41
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
Summary
42
Summary – Take away points
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real
time using Spark Streaming and Spark Structured Streaming
43
Join the Apache
Bahir community
44
References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
45Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
46March 30 2018 / © 2018 IBM Corporation

More Related Content

What's hot (20)

Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin, Helium and Beyond
Apache Zeppelin, Helium and Beyond
DataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
Hortonworks
 
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
Timothy Spann
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Spark Overview
Apache Spark Overview
airisData
 
Migrating pipelines into Docker
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Data Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
Vinay Shukla
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
DataWorks Summit
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
Hortonworks
 
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
Timothy Spann
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Spark Overview
Apache Spark Overview
airisData
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Data Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
Vinay Shukla
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
DataWorks Summit
 

Similar to IoT Applications and Patterns using Apache Spark & Apache Bahir (20)

Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Started with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Apache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark Components
Apache Spark Components
Girish Khanzode
 
Glint with Apache Spark
Glint with Apache Spark
Venkata Naga Ravi
 
Apache Spark Streaming
Apache Spark Streaming
Bartosz Jankiewicz
 
Not Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Apache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
APACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
An introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Not Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Apache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
An introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Ad

More from Luciano Resende (20)

A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
Luciano Resende
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Asf icfoss-mentoring
Asf icfoss-mentoring
Luciano Resende
 
A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
Luciano Resende
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Ad

Recently uploaded (20)

SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...
apidays
 

IoT Applications and Patterns using Apache Spark & Apache Bahir

  • 1. IoT Applications and Patterns using Apache Spark & Apache Bahir Luciano Resende June 14th, 2018 © 2018 IBM Corporation 1
  • 2. About me - Luciano Resende 2 Data Science Platform Architect – IBM – CODAIT • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms [email protected] https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende
  • 3. Open Source Community Leadership C O D A I T Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  • 4. Center for Open Source Data and AI Technologies CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 5
  • 5. Agenda 6 Introductions - Apache Spark - Apache Bahir IoT Applications Live Demo Summary References Q&A
  • 7. Apache Spark Introduction 8 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  • 9. Apache Spark – Spark SQL 10 Spark SQL Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  • 10. Apache Spark – Spark SQL 11 You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows
  • 11. Apache Spark – Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset. 12
  • 12. Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 13
  • 13. Apache Spark – Spark SQL – Data Sources Data Sources under the covers - Data source registration (e.g. spark.read.datasource) - Provide BaseRelation implementation • That implements support for table scans: – TableScans, PrunedScan, PrunedFilteredScan, CatalystScan - Detailed information available at • https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/ 14
  • 14. Apache Spark – Spark SQL – Data Sources Data Sources V1 Limitations - Leak upper-level API in the data source (DataFrame/SQLContext) - Hard to extend the Data Sources API for more optimizations - Zero transaction guarantee in the write APIs - Limited Extensibility 15
  • 15. Apache Spark – Spark SQL – Data Sources Data Sources V2 - Support for row-based scan and columnar scan - Column pruning and filter push-down - Can report basic statistics and data partitioning - Transactional write API - Streaming source and sink support for micro-batch and continuous mode - Detailed information available at • https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/ 16
  • 16. Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as unbounded table
  • 17. Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 18 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  • 18. Apache Spark – Spark Streaming 19 Spark Streaming Micro-batch event processing for near-real time analytics e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. No multi-threading or parallel process programming required
  • 19. Apache Spark – Spark Streaming Also known as discretized stream or DStream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 20
  • 20. Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 21
  • 22. Origins of the Apache Bahir Project MAY/2016: Established as a top-level Apache Project. - PMC formed by Apache Spark committers/pmc, Apache Members - Initial contributions imported from Apache Spark AUG/2016: Apache Flink community join Apache Bahir - Initial contributions of Flink extensions - In October 2016 Robert Metzger elected committer
  • 23. Origins of the Bahir name Naming an Apache Project is a science !!! - We needed a name that wasn’t used yet - Needed to be related to Spark We ended up with : Bahir - A name of Arabian origin that means Sparkling, - Also associated with a guy who succeeds at everything
  • 24. Why Apache Bahir It’s an Apache project - And if you are here, you know what it means Benefits of curating your extensions at Apache Bahir - Apache Governance - Apache License - Apache Community - Apache Brand 25
  • 25. Why Apache Bahir Flexibility - Release flexibility • Bounded to platform or component release Shared infrastructure - Release, CI, etc Shared knowledge - Collaborate with experts on both platform and component areas 26
  • 26. Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 27
  • 27. Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub 28
  • 28. Apache Spark extensions in Bahir Adding Bahir extensions into your application - Using SBT libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” - Using Maven <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 29
  • 29. Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark - Spark-shell bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. - Spark-submit bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 30
  • 31. IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 32
  • 32. IoT – Interaction between multiple entities 33 Things Software People control observe inform command actuate inform
  • 33. IoT Patterns – Some of them … 35 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  • 34. MQTT – M2M / IoT Connectivity Protocol 37 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementatio ns Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) V5 May 2018
  • 35. MQTT – Quality of Service 38 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  • 36. MQTT – World usage Smart Home Automation Messaging Notable Mentions: - IBM IoT Platform - AWS IoT - Microsoft IoT Hub - Facebook Messanger 39
  • 38. IoT Simulator using MQTT The demo environment https://github.com/lresende/bahir-iot-demo 41 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto
  • 40. Summary – Take away points Apache Spark - IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir - Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications - Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming and Spark Structured Streaming 43
  • 41. Join the Apache Bahir community 44
  • 42. References Apache Bahir http://bahir.apache.org Documentation for Apache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 45Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
  • 43. 46March 30 2018 / © 2018 IBM Corporation