SlideShare a Scribd company logo
Analyzing	Flight	Delays	with	Apache	Spark	
GraphFrames	and	MapR-DB
2 © 2018 MapR Technologies, Inc. // MapR Confidential
Agenda	
•  Introduction	to	Graphs		
•  Introduction	to	GraphFrames	with	a	simple	Flight	Dataset	
•  Use	GraphFrames	with	Flight	Dataset	for	2018	
2
Intro	to	Graphs
4 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Graph:	Models	Relations	between	Objects	
•  Graph:	Vertices	connected	by	Edges	
•  Vertices:	the	objects	
•  Edges:	the	relationships	between	Vertices	
What	is	a	Graph?
5 © 2018 MapR Technologies, Inc. // MapR Confidential
Regular	graph:	each	vertex	has	the	same	
number	of	edges	
Example:	Facebook	friends	
– Ted	is	a	friend	of	Carol	
– Carol	is	a	friend	of	Ted	
Regular	Graphs	vs	Directed	Graphs
6 © 2018 MapR Technologies, Inc. // MapR Confidential
Directed	graph:	edges	have	a	direction	
Example:	Twitter	followers	
– Carol	follows	Oprah	
– Oprah	does	not	follow	Carol	
Regular	Graphs	vs	Directed	Graphs
7 © 2018 MapR Technologies, Inc. // MapR Confidential
Property	Graph:		
•  Edges	and	Vertexes	have	properties	
•  Vertex	can	have	multiple	directed	
edges	in	parallel	
•  Allows	multiple	relationships	
Spark	GraphX	supports	a	distributed	
property	graph.		
Property	Graph	 Properties:	
City,State	
Properties:	
Flight	number,	
Distance,	
Delay
8 © 2018 MapR Technologies, Inc. // MapR Confidential
What	is	GraphX?	
Spark SQL
•  Structured Data
•  Querying with
SQL/HQL
•  DataFrames
Spark Streaming
•  Processing of live
streams
•  Micro-batching
MLlib
•  Machine Learning
•  Multiple types of
ML algorithms
GraphX
•  Graph processing
•  Graph parallel
computations
•  Task scheduling
•  Memory management
•  Fault recovery
•  Interacting with storage systems
Spark Core
Graph	Algorithms	and		Graph	
Queries	with	GraphFrames
10 © 2018 MapR Technologies, Inc. // MapR Confidential
Web	Sites	
•  Vertices	=	Web	Pages	
•  Edges	=	Links	between	Pages	
•  PageRank	Importance	=		
•  Iterative	Number	of	Links	to	a	
page	and	it’s	linking	pages	
•  Twitter	Example:		who	has	the	most	
twitter	followers	
	
Graph	Algorithms:	PageRank	
Vertex=	
Web	Page	
Edge=	
Link	
Importance	depends	
on	Number	and	Rank	
of	linking	pages
11 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
Graph	Algorithms:	PageRank	
0.20	 0.20	
0.20	
0.20	 0.20	
Message	function	
Sent	from	each	
vertex
12 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
2.  Messages	are	Aggregated	and	
Calculated	at	each	destination	vertex	
3.  Sum	of	messages	becomes	new	
vertex	Page	rank	
4.  Repeat		
Graph	Algorithms:	PageRank	 Messages	Aggregated		
and		
Calculated	at	each	Vertex
13 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Many	Graph	Algorithms	Aggregate	properties	of	neighbors:	
•  PageRank	
•  Connected	Components	
•  Shortest	Path	
Graph	Algorithms	
Connected	Components	
Reference	https://en.wikipedia.org/	
Shortest	Path	A	to	F
14 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Who	should	we	recommend	for	Carol	to	Follow?	
•  Carol	follows	Oprah		
•  Oprah	follows	Reese	Witherspoon	
•  Recommend	Carol	to	follow	Reese	
Graph	Motif	Queries	
Reese
WitherspoonCarol
follows
Oprah
follows
recommend?
15 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Recommend	who	to	Follow?	Search	for	patterns	
•  A	follows	B	
•  B	follows	C	
•  A	does	not	follow	C	
Graph	Motif	Queries	
A
follows
B
follows
recommend
C
16 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter:	A	follows	B;	B	follows	C;	A	doesn’t	follow	C
graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
Graph	Query:	Motif	Find	Structural	Pattern	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
a doesn’t follow c
(b)-[]->(c)
b follows c
(a)-[]->(b)
a follows b
(b)
Search for a
pattern
17 © 2018 MapR Technologies, Inc. // MapR Confidential
Separate	Systems	
Image	reference	Spark	Summit
18 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrames:	Graph	Algorithms	+	Graph	Queries	
Image	reference	Spark	Summit
Graph	Examples
20 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter	Tweets:	
morally	outraged	tweets	retweeted	within	
political	sphere	
But	rarely	outside	sphere	
	
	
Real	World	Graphs:	Twitter	
Reference	National	Academy	of	Sciences
21 © 2018 MapR Technologies, Inc. // MapR Confidential
Recommendation	Engine:	
•  Vertices	=	Users,	Products	
•  Edges	=	Ratings	or	Purchases	
•  Calculate	how	similar	users	rated	
similar	products	
	
Graph:	Recommendation	Engines
22 © 2018 MapR Technologies, Inc. // MapR Confidential
Healthcare	Fraud:	
•  Vertices	=	Doctors,	Patients,	
Prescriptions	
•  Edges	=	prescribed	
•  Calculate	Narcotic	Abuse,	Patient	
Similarity,	Over	prescribing	
	
Real	World	Graphs:	Fraud	
Prescribed	
Prescribed	
Prescribed
23 © 2018 MapR Technologies, Inc. // MapR Confidential
Credit	Card	Aplication	Fraud:	
•  Vertices	=	Credit	Card	Applicant,	
Phone,	email,	address,	ssn	
•  Edges	=	Identifier	
•  Detect	People	sharing	identifiers	such	
as	telephone	number	
	
Real	World	Graphs:	Fraud	
Shared	
Identifier	
Phone	number	
	
Image	reference	Capitol	One	at	Spark	Summit
A	Simple	Flight	Example	with	GraphFrames
25 © 2018 MapR Technologies, Inc. // MapR Confidential
Simple	Flight	Example	with	GraphFrames 		
Originating	
Airport	
Destination	
Airport	
Distance	 Delay		
SFO	 ORD	 1800	miles	 40	
ORD	 DFW	 800	miles	 0	
DFW	 SFO	 1400	miles	 10
26 © 2018 MapR Technologies, Inc. // MapR Confidential
Vertex	Table
27 © 2018 MapR Technologies, Inc. // MapR Confidential
Edges	Table
28 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Airport(id: String, city: String)  
val airports=Array(Airport("SFO","San Francisco"),
Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))
 
val vertices = spark.createDataset(airports).toDF
vertices.show
+---+-----------------+
| id| city|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	a	Vertices	DataFrame	
Id	 City	
SFO	 San	Francisco	
ORD	 Chicago	
DFW	 Dallas
29 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Flight(id: String, src: String, dst: String,
dist: Double, delay: Double)
val flights=Array(
Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40),
Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0),
Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10))
val edges = spark.createDataset(flights).toDF
edges.show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
Create	an	Edges	DataFrame
30 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(vertices, edges)
graph.vertices.show
 
+---+-----------------+
| id| name|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	the	GraphFrame
31 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.show
 
result:
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
GraphFrame	Edges
32 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	airports	are	there?	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flights?	
	
	
Graph	Operators
33 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 3
// How many flights?
graph.edges.count
 
result: = 3
Query	the	GraphFrame
34 © 2018 MapR Technologies, Inc. // MapR Confidential
// flight routes > 800 miles distance?
graph.edges.filter("dist > 800").show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
 
Query	the	GraphFrame
Loading	and	Exploring	the	MapR-DB	Flight	
Table	with	DataFrames
36 © 2018 MapR Technologies, Inc. // MapR Confidential
How	a	Spark	Application	Runs	on	a	Cluster
37 © 2018 MapR Technologies, Inc. // MapR Confidential
•  A	Dataset	is	a	collection	of	Typed	Objects	
•  Dataset[T]		
•  (can	use	SQL	and	functions)	
•  A	DataFrame	is	a	Dataset	of	Row	objects			
•  Dataset[Row]	
•  (can	use	SQL)	
•  Partitioned	across	a	cluster	
•  Operated	on	in	parallel	
•  can	be	Cached	
	
Spark	Distributed	Datasets	
partitioned
38 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality
	
Spark	SQL	Querying	MapR-DB	JSON
39 © 2018 MapR Technologies, Inc. // MapR Confidential
Designed	for	Partitioning	and	Scaling	
Data is automatically partitioned
and sorted by id row key!
40 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB	Connector
41 © 2018 MapR Technologies, Inc. // MapR Confidential
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
Flight Dataset
Table is automatically partitioned
and sorted by id row key!
42 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by id row key!
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
43 © 2018 MapR Technologies, Inc. // MapR Confidential
Row	key	=	Table	is	Partitioned	by	src,dst	vertexes	
Data is automatically partitioned by key
range and sorted = src_dst
ATL_LGA_2017-01-01_AA_1678!
44 © 2018 MapR Technologies, Inc. // MapR Confidential
SFO
DEN
IAH
ATL
ORD
BOS
LGA
EWR
MIA
SEA	
LAX	
DFW	
Airports
45 © 2018 MapR Technologies, Inc. // MapR Confidential
Load	the	data	into	a	Dataset:	Define	the	Schema
46 © 2018 MapR Technologies, Inc. // MapR Confidential
var tableName = "/user/mapr/flighttable”
val df = spark.sparkSession
.loadFromMapRDB[Flight](tableName, schema)
Read	Dataset	from	MapR-DB	
Worker	
Task	
Worker	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task	
Task	
Driver	
tasks
tasks
tasks
47 © 2018 MapR Technologies, Inc. // MapR Confidential
df.show(5)
Show	the	first	rows	of	the	DataFrame	
columns
row
Data is automatically partitioned and
sorted by row key = src dst
ATL_BOS_2018-01-01_AA_1678!
48 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy(”src”)
.count().orderBy(desc(“count”)).show(5)
+---+-----+
|src|count|
+---+-----+
|ORD| 4033|
|ATL| 3106|
|DFW| 2782|
|EWR| 2328|
|DEN| 2304|
+---+-----+
Originating	airports	with	highest	number	of	Departure	Delays
49 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy("src")
.count.orderBy(desc("count" )).explain
== Physical Plan ==
*(3) Sort [count#549L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5], functions=[count(1)])
+- Exchange hashpartitioning(src#5, 200)
+- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)])
+- *(1) Project [src#5]
+- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay,
40.0)], ReadSchema: struct<src:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down	
Project and Filter pushed into
MapR-DB!
50 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB		Projection	Filter	push	down	
Projection and Filter pushdown reduces the
amount of data passed between MapR-DB
and the Spark engine when selecting and
filtering data.
	
Data is selected and filtered in
MapR-DB!
51 © 2018 MapR Technologies, Inc. // MapR Confidential
df.cache	
df.count()	
df.createOrReplaceTempView("flights")	
	
Long	=	282628	
Register	Dataframe	as	a	Temporary	View
52 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select carrier, avg(depdelay) from flights
group by carrier
Average	Departure	Delay	by	Carrier
53 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src, count(depdelay) from flights
where depdelay > 40 group by src
Count	of		Departure	Delays	by	Origin
54 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src,dst count(depdelay) from flights
where depdelay > 40 group by src,dst
Count	of		Departure	Delays	by	Origin,	Destination
Explore	MapR-DB	Flight	Table	with	
GraphFrames
56 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flight	routes?	
	
	
GraphFrame	and	DataFrame
57 © 2018 MapR Technologies, Inc. // MapR Confidential
val airports = spark.read.json(file)
airports.show
+-------------+-------+-----+---+
| City|Country|State| id|
+-------------+-------+-----+---+
| Chicago| USA| IL|ORD|
| New York| USA| NY|JFK|
| New York| USA| NY|LGA|
| Boston| USA| MA|BOS|
| Houston| USA| TX|IAH|
| Newark| USA| NJ|EWR|
| Denver| USA| CO|DEN|
| Miami| USA| FL|MIA|
|San Francisco| USA| CA|SFO|
| Atlanta| USA| GA|ATL|
| Dallas| USA| TX|DFW|
| Charlotte| USA| NC|CLT|
| Los Angeles| USA| CA|LAX|
| Seattle| USA| WA|SEA|
+-------------+-------+-----+---+
Read	Vertices	DataFrame	from	a	JSON	File
58 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(airports, df)
// graph.edges is a DataFrame
graph.edges.show
 
Create	the	GraphFrame
59 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents,	triangleCount	
Graph	Queries	 Motif	find
60 © 2018 MapR Technologies, Inc. // MapR Confidential
DataFrame	Queries	
Operation	 Description	
select(col) Selects	set	of	columns	
sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column	
filter(expr);
where(condition)
Filter	based	on	the	SQL	expression	or	condition	
groupBy(cols:
Columns)
Groups	DataFrame	using	specified	columns	
join (DataFrame,
joinExpr)
Joins	with	another	DataFrame	using	given	join	expression	
count Count	of	rows		
avg, count, min,
max, sum (col)
Average	,	count	,	min	,	max	on	values	in	a	group
61 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.vertices.filter("State='TX'").show
+-------+-------+-----+---+
| City|Country|State| id|
+-------+-------+-----+---+
|Houston| USA| TX|IAH|
| Dallas| USA| TX|DFW|
+-------+-------+-----+---+
Graph	Vertices	and	Edges	are	DataFrames
62 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 13
// How many flights?
graph.edges.count
 
result: = 282628
GraphFrame	DataFrame	Queries
63 © 2018 MapR Technologies, Inc. // MapR Confidential
// Show the longest distance flight routes
graph.edges.groupBy("src", "dst")
.max("dist").sort(desc("max(dist)")).show(4)
+---+---+---------+
|src|dst|max(dist)|
+---+---+---------+
|MIA|SEA| 2724.0|
|SEA|MIA| 2724.0|
|BOS|SFO| 2704.0|
|SFO|BOS| 2704.0|
+---+---+---------+ 
What	are	the	4	Longest	Distance	Flights?
64 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show
+---+---+------------------+
|src|dst| avg(depdelay)|
+---+---+------------------+
|ATL|EWR| 58.1085801063022|
|ATL|ORD| 46.42393736017897|
|ATL|DFW|39.454460966542754|
|ATL|LGA| 39.25498489425982|
|ATL|CLT| 37.56777108433735|
|ATL|SFO| 36.83008356545961|
+---+---+------------------+
What	is	the	average	delay	for	delayed	flights	from	Atlanta?
65 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain
== Physical Plan ==
*(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)])
+- Exchange hashpartitioning(src#5, dst#6, 200)
+- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)])
+- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) &&
(src#5 = ATL)) && (depdelay#9 > 1.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay),
EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema:
struct<src:string,dst:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down
66 © 2018 MapR Technologies, Inc. // MapR Confidential
z.show( graph.edges
.filter("src = 'ATL' and depdelay > 1”)
.groupBy("crsdephour")
.avg("depdelay”) )
What	is	the	Average	Delay	for	delayed	flights	from	Atlanta	by	
Hour?
67 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
68 © 2018 MapR Technologies, Inc. // MapR Confidential
WHAT	ARE	THE	HIGHEST	DEGREE	VERTEXES?	
z.show( graph.degrees.orderBy(desc("degree")) )
Which	Airports	have	the	most	incoming	and	outgoing	flights?
69 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
70 © 2018 MapR Technologies, Inc. // MapR Confidential
val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run()
ranks.vertices.orderBy($"pagerank".desc).show(5)
+-------------+-------+-----+---+-------------------+
| City|Country|State| id| pagerank|
+-------------+-------+-----+---+-------------------+
| Chicago| USA| IL|ORD| 1.5129929839358685|
| Atlanta| USA| GA|ATL| 1.4255481544216664|
| Los Angeles| USA| CA|LAX| 1.2787001001758738|
| Dallas| USA| TX|DFW| 1.1999252171688064|
| Denver| USA| CO|DEN| 1.1275194324360767|
+-------------+-------+-----+---+-------------------+
Use	Pagerank	to	find	most	important	airports
71 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
72 © 2018 MapR Technologies, Inc. // MapR Confidential
val AM = AggregateMessages
val msgToSrc = AM.edge("depdelay")
val agg = { graph
.aggregateMessages
.sendToSrc(msgToSrc)
.agg(avg(AM.msg).as("avgdelay"))}
agg.show()
+---+------------------+
| id| avgdelay|
+---+------------------+
|EWR|17.818079459546404|
|MIA|17.768691978431264|
|ORD| 16.5199551010227|
+---+------------------+
Aggregate	Messages	to	calculate	avg	delay
73 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
// how many routes?
flightroutecount.count
Long = 148
What	are	the	most	Frequent	Flight	Routes?
74 © 2018 MapR Technologies, Inc. // MapR Confidential
(HIGHEST	COUNT	OF	FLIGHTS)	
z.show (flightroutecount )
What	are	the	most	Frequent	Flight	Routes?
75 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
76 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
77 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.filter("src.State='TX'”)
.show
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|src |edge |dst |
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]|
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
DataFrames
Refine the result
78 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(src)-[edge]->(dst)")
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Motif	find	
dstsrc
edge
Search for a
pattern
79 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
Next:	use	flightroutecount	with	Motif	find
80 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
(b)-[]->(c)(a)-[]->(b)
(b)
Search for a
pattern
DataFrames
Refine the result:
Remove duplicates
81 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection
82 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
83 © 2018 MapR Technologies, Inc. // MapR Confidential
val results = graph.shortestPaths.landmarks(Seq("LGA")).run()
+---+----------+
| id| distances|
+---+----------+
|IAH|[LGA -> 1]|
|CLT|[LGA -> 1]|
|LAX|[LGA -> 2]|
|DEN|[LGA -> 1]|
|DFW|[LGA -> 1]|
|SFO|[LGA -> 2]|
|LGA|[LGA -> 0]|
|ORD|[LGA -> 1]|
|MIA|[LGA -> 1]|
|SEA|[LGA -> 2]|
|ATL|[LGA -> 1]|
|BOS|[LGA -> 1]|
|EWR|[LGA -> 2]|
+---+----------+
Compute	shortest	paths		from	each	Airport	to	LGA
84 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
85 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(1).run().show()
+----+-------+-----+---+
|City|Country|State| id|
+----+-------+-----+---+
+----+-------+-----+---+
Breadth	First	Search	for	Direct	Flights	between	LAX	and	LGA
86 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(2).run().show(5)
+--------------------+--------------------+--------------------+--------------------+--------------------+
| from| e0| v1| e1| to|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA
87 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(a)-[ab]->(b); (b)-[bc]->(c)")
.filter("a.id = 'LAX'")
.filter("c.id = 'LGA'").show(4)
Motif	Search	for	Flights	between	LAX	and	LGA		
Search for a
pattern
DataFrames
Refine the result
88 © 2018 MapR Technologies, Inc. // MapR Confidential
val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”)
.maxPathLength(3).edgeFilter("carrier = 'AA'").run()
paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate")
.select("e0.id","e1.id").show(5)
+--------------------------+--------------------------+
|id |id |
+--------------------------+--------------------------+
|LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126|
|LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954|
+--------------------------+--------------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA	with	AA	
BFS
DataFrames
Refine the result
Resources
90 © 2018 MapR Technologies, Inc. // MapR Confidential
Link	to	Code	for	this	webinar	is	in	
appendix	of	this		book.			
https://mapr.com/ebook/getting-started-
with-apache-spark-v2/	
New	Spark	Ebook
91 © 2018 MapR Technologies, Inc. // MapR Confidential
92 © 2018 MapR Technologies, Inc. // MapR Confidential
•  MapR	Free	ODT	http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
93 © 2018 MapR Technologies, Inc. // MapR Confidential
https://mapr.com/blog/	
MapR	Blog
94 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR	Data	Platform	
Link to Code for this webinar is in
appendix of the book.
https://mapr.com/ebook/getting-
started-with-apache-spark-v2/

More Related Content

PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Pregel: A System For Large Scale Graph Processing
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Finding Graph Isomorphisms In GraphX And GraphFrames
PDF
GraphFrames: Graph Queries In Spark SQL
Web-Scale Graph Analytics with Apache® Spark™
Introducing DataFrames in Spark for Large Scale Data Science
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Pregel: A System For Large Scale Graph Processing
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Finding Graph Isomorphisms In GraphX And GraphFrames
GraphFrames: Graph Queries In Spark SQL

What's hot (20)

PPTX
Introduction to Apache Spark
PDF
CRISP-DM - Agile Approach To Data Mining Projects
PPTX
Securing data in hybrid environments using Apache Ranger
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Databricks Delta Lake and Its Benefits
PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PPTX
Spark streaming
PPTX
Introduction to ML with Apache Spark MLlib
PPTX
iceberg introduction.pptx
PDF
Spark graphx
PDF
The Apache Spark File Format Ecosystem
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Catalyst optimizer
PDF
Introduction to Apache Hive
PDF
Spark streaming , Spark SQL
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPT
Graph database
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Introduction to Apache Spark
CRISP-DM - Agile Approach To Data Mining Projects
Securing data in hybrid environments using Apache Ranger
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks Delta Lake and Its Benefits
Beyond SQL: Speeding up Spark with DataFrames
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Spark streaming
Introduction to ML with Apache Spark MLlib
iceberg introduction.pptx
Spark graphx
The Apache Spark File Format Ecosystem
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Catalyst optimizer
Introduction to Apache Hive
Spark streaming , Spark SQL
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Graph database
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Ad

Similar to Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB (20)

PDF
Predicting Flight Delays with Spark Machine Learning
PPTX
Apache Spark Machine Learning Decision Trees
PDF
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
PPTX
Property Graphs in APEX.pptx
PPTX
Optimizing Your Supply Chain with Neo4j
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
PDF
Workshop - Build a Graph Solution
PDF
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
PDF
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
PDF
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
PDF
Office Tool Plus Free Download (Latest 2025)
PDF
Evernote 10.132.4.49891 With Crack free
PDF
RadioBOSS Advanced 7.0.8 Free Download
PDF
Roadmap y Novedades de producto
PDF
Adobe Photoshop 2025 Free crack Download
PDF
Free Code Friday - Machine Learning with Apache Spark
PDF
Surprising Advantages of Streaming - ACM March 2018
PDF
A practical guide to GIS in Civil 3D
PDF
54147 Session PPT - ComplexRelationshipsMadeSimple.pdf
PDF
Container and Kubernetes without limits
Predicting Flight Delays with Spark Machine Learning
Apache Spark Machine Learning Decision Trees
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
Property Graphs in APEX.pptx
Optimizing Your Supply Chain with Neo4j
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Workshop - Build a Graph Solution
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Office Tool Plus Free Download (Latest 2025)
Evernote 10.132.4.49891 With Crack free
RadioBOSS Advanced 7.0.8 Free Download
Roadmap y Novedades de producto
Adobe Photoshop 2025 Free crack Download
Free Code Friday - Machine Learning with Apache Spark
Surprising Advantages of Streaming - ACM March 2018
A practical guide to GIS in Civil 3D
54147 Session PPT - ComplexRelationshipsMadeSimple.pdf
Container and Kubernetes without limits
Ad

More from Carol McDonald (20)

PDF
Introduction to machine learning with GPUs
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
PDF
Demystifying AI, Machine Learning and Deep Learning
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
PDF
Streaming patterns revolutionary architectures
PDF
Spark machine learning predicting customer churn
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PDF
Applying Machine Learning to Live Patient Data
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
PDF
Advanced Threat Detection on Streaming Data
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
PDF
Apache Spark Machine Learning
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Apache Spark streaming and HBase
PDF
Machine Learning Recommendations with Spark
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Streaming patterns revolutionary architectures
Spark machine learning predicting customer churn
Fast Cars, Big Data How Streaming can help Formula 1
Applying Machine Learning to Live Patient Data
Streaming Patterns Revolutionary Architectures with the Kafka API
Advanced Threat Detection on Streaming Data
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Apache Spark Machine Learning
Build a Time Series Application with Apache Spark and Apache HBase
Apache Spark streaming and HBase
Machine Learning Recommendations with Spark

Recently uploaded (20)

PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Transform Your Business with a Software ERP System
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
How to Confidently Manage Project Budgets
PDF
System and Network Administraation Chapter 3
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
ai tools demonstartion for schools and inter college
PPTX
AIRLINE PRICE API | FLIGHT API COST |
PDF
medical staffing services at VALiNTRY
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Digital Strategies for Manufacturing Companies
Which alternative to Crystal Reports is best for small or large businesses.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
ManageIQ - Sprint 268 Review - Slide Deck
How to Choose the Right IT Partner for Your Business in Malaysia
Materi_Pemrograman_Komputer-Looping.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Transform Your Business with a Software ERP System
Softaken Excel to vCard Converter Software.pdf
Odoo POS Development Services by CandidRoot Solutions
How to Confidently Manage Project Budgets
System and Network Administraation Chapter 3
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
ai tools demonstartion for schools and inter college
AIRLINE PRICE API | FLIGHT API COST |
medical staffing services at VALiNTRY
A REACT POMODORO TIMER WEB APPLICATION.pdf
Understanding Forklifts - TECH EHS Solution
ISO 45001 Occupational Health and Safety Management System
Digital Strategies for Manufacturing Companies

Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB

  • 2. 2 © 2018 MapR Technologies, Inc. // MapR Confidential Agenda •  Introduction to Graphs •  Introduction to GraphFrames with a simple Flight Dataset •  Use GraphFrames with Flight Dataset for 2018 2
  • 4. 4 © 2018 MapR Technologies, Inc. // MapR Confidential •  Graph: Models Relations between Objects •  Graph: Vertices connected by Edges •  Vertices: the objects •  Edges: the relationships between Vertices What is a Graph?
  • 5. 5 © 2018 MapR Technologies, Inc. // MapR Confidential Regular graph: each vertex has the same number of edges Example: Facebook friends – Ted is a friend of Carol – Carol is a friend of Ted Regular Graphs vs Directed Graphs
  • 6. 6 © 2018 MapR Technologies, Inc. // MapR Confidential Directed graph: edges have a direction Example: Twitter followers – Carol follows Oprah – Oprah does not follow Carol Regular Graphs vs Directed Graphs
  • 7. 7 © 2018 MapR Technologies, Inc. // MapR Confidential Property Graph: •  Edges and Vertexes have properties •  Vertex can have multiple directed edges in parallel •  Allows multiple relationships Spark GraphX supports a distributed property graph. Property Graph Properties: City,State Properties: Flight number, Distance, Delay
  • 8. 8 © 2018 MapR Technologies, Inc. // MapR Confidential What is GraphX? Spark SQL •  Structured Data •  Querying with SQL/HQL •  DataFrames Spark Streaming •  Processing of live streams •  Micro-batching MLlib •  Machine Learning •  Multiple types of ML algorithms GraphX •  Graph processing •  Graph parallel computations •  Task scheduling •  Memory management •  Fault recovery •  Interacting with storage systems Spark Core
  • 10. 10 © 2018 MapR Technologies, Inc. // MapR Confidential Web Sites •  Vertices = Web Pages •  Edges = Links between Pages •  PageRank Importance = •  Iterative Number of Links to a page and it’s linking pages •  Twitter Example: who has the most twitter followers Graph Algorithms: PageRank Vertex= Web Page Edge= Link Importance depends on Number and Rank of linking pages
  • 11. 11 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors Graph Algorithms: PageRank 0.20 0.20 0.20 0.20 0.20 Message function Sent from each vertex
  • 12. 12 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors 2.  Messages are Aggregated and Calculated at each destination vertex 3.  Sum of messages becomes new vertex Page rank 4.  Repeat Graph Algorithms: PageRank Messages Aggregated and Calculated at each Vertex
  • 13. 13 © 2018 MapR Technologies, Inc. // MapR Confidential •  Many Graph Algorithms Aggregate properties of neighbors: •  PageRank •  Connected Components •  Shortest Path Graph Algorithms Connected Components Reference https://en.wikipedia.org/ Shortest Path A to F
  • 14. 14 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Who should we recommend for Carol to Follow? •  Carol follows Oprah •  Oprah follows Reese Witherspoon •  Recommend Carol to follow Reese Graph Motif Queries Reese WitherspoonCarol follows Oprah follows recommend?
  • 15. 15 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Recommend who to Follow? Search for patterns •  A follows B •  B follows C •  A does not follow C Graph Motif Queries A follows B follows recommend C
  • 16. 16 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter: A follows B; B follows C; A doesn’t follow C graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") Graph Query: Motif Find Structural Pattern Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) a doesn’t follow c (b)-[]->(c) b follows c (a)-[]->(b) a follows b (b) Search for a pattern
  • 17. 17 © 2018 MapR Technologies, Inc. // MapR Confidential Separate Systems Image reference Spark Summit
  • 18. 18 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrames: Graph Algorithms + Graph Queries Image reference Spark Summit
  • 20. 20 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter Tweets: morally outraged tweets retweeted within political sphere But rarely outside sphere Real World Graphs: Twitter Reference National Academy of Sciences
  • 21. 21 © 2018 MapR Technologies, Inc. // MapR Confidential Recommendation Engine: •  Vertices = Users, Products •  Edges = Ratings or Purchases •  Calculate how similar users rated similar products Graph: Recommendation Engines
  • 22. 22 © 2018 MapR Technologies, Inc. // MapR Confidential Healthcare Fraud: •  Vertices = Doctors, Patients, Prescriptions •  Edges = prescribed •  Calculate Narcotic Abuse, Patient Similarity, Over prescribing Real World Graphs: Fraud Prescribed Prescribed Prescribed
  • 23. 23 © 2018 MapR Technologies, Inc. // MapR Confidential Credit Card Aplication Fraud: •  Vertices = Credit Card Applicant, Phone, email, address, ssn •  Edges = Identifier •  Detect People sharing identifiers such as telephone number Real World Graphs: Fraud Shared Identifier Phone number Image reference Capitol One at Spark Summit
  • 25. 25 © 2018 MapR Technologies, Inc. // MapR Confidential Simple Flight Example with GraphFrames Originating Airport Destination Airport Distance Delay SFO ORD 1800 miles 40 ORD DFW 800 miles 0 DFW SFO 1400 miles 10
  • 26. 26 © 2018 MapR Technologies, Inc. // MapR Confidential Vertex Table
  • 27. 27 © 2018 MapR Technologies, Inc. // MapR Confidential Edges Table
  • 28. 28 © 2018 MapR Technologies, Inc. // MapR Confidential case class Airport(id: String, city: String)   val airports=Array(Airport("SFO","San Francisco"), Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))   val vertices = spark.createDataset(airports).toDF vertices.show +---+-----------------+ | id| city| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create a Vertices DataFrame Id City SFO San Francisco ORD Chicago DFW Dallas
  • 29. 29 © 2018 MapR Technologies, Inc. // MapR Confidential case class Flight(id: String, src: String, dst: String, dist: Double, delay: Double) val flights=Array( Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40), Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0), Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10)) val edges = spark.createDataset(flights).toDF edges.show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ Create an Edges DataFrame
  • 30. 30 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(vertices, edges) graph.vertices.show   +---+-----------------+ | id| name| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create the GraphFrame
  • 31. 31 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.show   result: +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ GraphFrame Edges
  • 32. 32 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many airports are there? How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flights? Graph Operators
  • 33. 33 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 3 // How many flights? graph.edges.count   result: = 3 Query the GraphFrame
  • 34. 34 © 2018 MapR Technologies, Inc. // MapR Confidential // flight routes > 800 miles distance? graph.edges.filter("dist > 800").show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+   Query the GraphFrame
  • 36. 36 © 2018 MapR Technologies, Inc. // MapR Confidential How a Spark Application Runs on a Cluster
  • 37. 37 © 2018 MapR Technologies, Inc. // MapR Confidential •  A Dataset is a collection of Typed Objects •  Dataset[T] •  (can use SQL and functions) •  A DataFrame is a Dataset of Row objects •  Dataset[Row] •  (can use SQL) •  Partitioned across a cluster •  Operated on in parallel •  can be Cached Spark Distributed Datasets partitioned
  • 38. 38 © 2018 MapR Technologies, Inc. // MapR Confidential •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 39. 39 © 2018 MapR Technologies, Inc. // MapR Confidential Designed for Partitioning and Scaling Data is automatically partitioned and sorted by id row key!
  • 40. 40 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Connector
  • 41. 41 © 2018 MapR Technologies, Inc. // MapR Confidential { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 } Flight Dataset Table is automatically partitioned and sorted by id row key!
  • 42. 42 © 2018 MapR Technologies, Inc. // MapR Confidential MapR-DB JSON Document Store Data is automatically partitioned and sorted by id row key! { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 }
  • 43. 43 © 2018 MapR Technologies, Inc. // MapR Confidential Row key = Table is Partitioned by src,dst vertexes Data is automatically partitioned by key range and sorted = src_dst ATL_LGA_2017-01-01_AA_1678!
  • 44. 44 © 2018 MapR Technologies, Inc. // MapR Confidential SFO DEN IAH ATL ORD BOS LGA EWR MIA SEA LAX DFW Airports
  • 45. 45 © 2018 MapR Technologies, Inc. // MapR Confidential Load the data into a Dataset: Define the Schema
  • 46. 46 © 2018 MapR Technologies, Inc. // MapR Confidential var tableName = "/user/mapr/flighttable” val df = spark.sparkSession .loadFromMapRDB[Flight](tableName, schema) Read Dataset from MapR-DB Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 47. 47 © 2018 MapR Technologies, Inc. // MapR Confidential df.show(5) Show the first rows of the DataFrame columns row Data is automatically partitioned and sorted by row key = src dst ATL_BOS_2018-01-01_AA_1678!
  • 48. 48 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy(”src”) .count().orderBy(desc(“count”)).show(5) +---+-----+ |src|count| +---+-----+ |ORD| 4033| |ATL| 3106| |DFW| 2782| |EWR| 2328| |DEN| 2304| +---+-----+ Originating airports with highest number of Departure Delays
  • 49. 49 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy("src") .count.orderBy(desc("count" )).explain == Physical Plan == *(3) Sort [count#549L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5], functions=[count(1)]) +- Exchange hashpartitioning(src#5, 200) +- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)]) +- *(1) Project [src#5] +- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay, 40.0)], ReadSchema: struct<src:string,depdelay:double> MapR-DB Projection and Filter push down Project and Filter pushed into MapR-DB!
  • 50. 50 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Projection Filter push down Projection and Filter pushdown reduces the amount of data passed between MapR-DB and the Spark engine when selecting and filtering data. Data is selected and filtered in MapR-DB!
  • 51. 51 © 2018 MapR Technologies, Inc. // MapR Confidential df.cache df.count() df.createOrReplaceTempView("flights") Long = 282628 Register Dataframe as a Temporary View
  • 52. 52 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select carrier, avg(depdelay) from flights group by carrier Average Departure Delay by Carrier
  • 53. 53 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src, count(depdelay) from flights where depdelay > 40 group by src Count of Departure Delays by Origin
  • 54. 54 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src,dst count(depdelay) from flights where depdelay > 40 group by src,dst Count of Departure Delays by Origin, Destination
  • 56. 56 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flight routes? GraphFrame and DataFrame
  • 57. 57 © 2018 MapR Technologies, Inc. // MapR Confidential val airports = spark.read.json(file) airports.show +-------------+-------+-----+---+ | City|Country|State| id| +-------------+-------+-----+---+ | Chicago| USA| IL|ORD| | New York| USA| NY|JFK| | New York| USA| NY|LGA| | Boston| USA| MA|BOS| | Houston| USA| TX|IAH| | Newark| USA| NJ|EWR| | Denver| USA| CO|DEN| | Miami| USA| FL|MIA| |San Francisco| USA| CA|SFO| | Atlanta| USA| GA|ATL| | Dallas| USA| TX|DFW| | Charlotte| USA| NC|CLT| | Los Angeles| USA| CA|LAX| | Seattle| USA| WA|SEA| +-------------+-------+-----+---+ Read Vertices DataFrame from a JSON File
  • 58. 58 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(airports, df) // graph.edges is a DataFrame graph.edges.show   Create the GraphFrame
  • 59. 59 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents, triangleCount Graph Queries Motif find
  • 60. 60 © 2018 MapR Technologies, Inc. // MapR Confidential DataFrame Queries Operation Description select(col) Selects set of columns sort(sortcol) Returns new DataFrame sorted by specified column filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression count Count of rows avg, count, min, max, sum (col) Average , count , min , max on values in a group
  • 61. 61 © 2018 MapR Technologies, Inc. // MapR Confidential graph.vertices.filter("State='TX'").show +-------+-------+-----+---+ | City|Country|State| id| +-------+-------+-----+---+ |Houston| USA| TX|IAH| | Dallas| USA| TX|DFW| +-------+-------+-----+---+ Graph Vertices and Edges are DataFrames
  • 62. 62 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 13 // How many flights? graph.edges.count   result: = 282628 GraphFrame DataFrame Queries
  • 63. 63 © 2018 MapR Technologies, Inc. // MapR Confidential // Show the longest distance flight routes graph.edges.groupBy("src", "dst") .max("dist").sort(desc("max(dist)")).show(4) +---+---+---------+ |src|dst|max(dist)| +---+---+---------+ |MIA|SEA| 2724.0| |SEA|MIA| 2724.0| |BOS|SFO| 2704.0| |SFO|BOS| 2704.0| +---+---+---------+  What are the 4 Longest Distance Flights?
  • 64. 64 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show +---+---+------------------+ |src|dst| avg(depdelay)| +---+---+------------------+ |ATL|EWR| 58.1085801063022| |ATL|ORD| 46.42393736017897| |ATL|DFW|39.454460966542754| |ATL|LGA| 39.25498489425982| |ATL|CLT| 37.56777108433735| |ATL|SFO| 36.83008356545961| +---+---+------------------+ What is the average delay for delayed flights from Atlanta?
  • 65. 65 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain == Physical Plan == *(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)]) +- Exchange hashpartitioning(src#5, dst#6, 200) +- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)]) +- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) && (src#5 = ATL)) && (depdelay#9 > 1.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay), EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema: struct<src:string,dst:string,depdelay:double> MapR-DB Projection and Filter push down
  • 66. 66 © 2018 MapR Technologies, Inc. // MapR Confidential z.show( graph.edges .filter("src = 'ATL' and depdelay > 1”) .groupBy("crsdephour") .avg("depdelay”) ) What is the Average Delay for delayed flights from Atlanta by Hour?
  • 67. 67 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 68. 68 © 2018 MapR Technologies, Inc. // MapR Confidential WHAT ARE THE HIGHEST DEGREE VERTEXES? z.show( graph.degrees.orderBy(desc("degree")) ) Which Airports have the most incoming and outgoing flights?
  • 69. 69 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 70. 70 © 2018 MapR Technologies, Inc. // MapR Confidential val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run() ranks.vertices.orderBy($"pagerank".desc).show(5) +-------------+-------+-----+---+-------------------+ | City|Country|State| id| pagerank| +-------------+-------+-----+---+-------------------+ | Chicago| USA| IL|ORD| 1.5129929839358685| | Atlanta| USA| GA|ATL| 1.4255481544216664| | Los Angeles| USA| CA|LAX| 1.2787001001758738| | Dallas| USA| TX|DFW| 1.1999252171688064| | Denver| USA| CO|DEN| 1.1275194324360767| +-------------+-------+-----+---+-------------------+ Use Pagerank to find most important airports
  • 71. 71 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 72. 72 © 2018 MapR Technologies, Inc. // MapR Confidential val AM = AggregateMessages val msgToSrc = AM.edge("depdelay") val agg = { graph .aggregateMessages .sendToSrc(msgToSrc) .agg(avg(AM.msg).as("avgdelay"))} agg.show() +---+------------------+ | id| avgdelay| +---+------------------+ |EWR|17.818079459546404| |MIA|17.768691978431264| |ORD| 16.5199551010227| +---+------------------+ Aggregate Messages to calculate avg delay
  • 73. 73 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ // how many routes? flightroutecount.count Long = 148 What are the most Frequent Flight Routes?
  • 74. 74 © 2018 MapR Technologies, Inc. // MapR Confidential (HIGHEST COUNT OF FLIGHTS) z.show (flightroutecount ) What are the most Frequent Flight Routes?
  • 75. 75 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 76. 76 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge
  • 77. 77 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .filter("src.State='TX'”) .show +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |src |edge |dst | +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]| +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge DataFrames Refine the result
  • 78. 78 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(src)-[edge]->(dst)") .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Motif find dstsrc edge Search for a pattern
  • 79. 79 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ Next: use flightroutecount with Motif find
  • 80. 80 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) (b)-[]->(c)(a)-[]->(b) (b) Search for a pattern DataFrames Refine the result: Remove duplicates
  • 81. 81 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection
  • 82. 82 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 83. 83 © 2018 MapR Technologies, Inc. // MapR Confidential val results = graph.shortestPaths.landmarks(Seq("LGA")).run() +---+----------+ | id| distances| +---+----------+ |IAH|[LGA -> 1]| |CLT|[LGA -> 1]| |LAX|[LGA -> 2]| |DEN|[LGA -> 1]| |DFW|[LGA -> 1]| |SFO|[LGA -> 2]| |LGA|[LGA -> 0]| |ORD|[LGA -> 1]| |MIA|[LGA -> 1]| |SEA|[LGA -> 2]| |ATL|[LGA -> 1]| |BOS|[LGA -> 1]| |EWR|[LGA -> 2]| +---+----------+ Compute shortest paths from each Airport to LGA
  • 84. 84 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 85. 85 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(1).run().show() +----+-------+-----+---+ |City|Country|State| id| +----+-------+-----+---+ +----+-------+-----+---+ Breadth First Search for Direct Flights between LAX and LGA
  • 86. 86 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(2).run().show(5) +--------------------+--------------------+--------------------+--------------------+--------------------+ | from| e0| v1| e1| to| +--------------------+--------------------+--------------------+--------------------+--------------------+ |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| +--------------------+--------------------+--------------------+--------------------+--------------------+ Breadth First Search for Flights between LAX and LGA
  • 87. 87 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(a)-[ab]->(b); (b)-[bc]->(c)") .filter("a.id = 'LAX'") .filter("c.id = 'LGA'").show(4) Motif Search for Flights between LAX and LGA Search for a pattern DataFrames Refine the result
  • 88. 88 © 2018 MapR Technologies, Inc. // MapR Confidential val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”) .maxPathLength(3).edgeFilter("carrier = 'AA'").run() paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate") .select("e0.id","e1.id").show(5) +--------------------------+--------------------------+ |id |id | +--------------------------+--------------------------+ |LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126| |LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954| +--------------------------+--------------------------+ Breadth First Search for Flights between LAX and LGA with AA BFS DataFrames Refine the result
  • 90. 90 © 2018 MapR Technologies, Inc. // MapR Confidential Link to Code for this webinar is in appendix of this book. https://mapr.com/ebook/getting-started- with-apache-spark-v2/ New Spark Ebook
  • 91. 91 © 2018 MapR Technologies, Inc. // MapR Confidential
  • 92. 92 © 2018 MapR Technologies, Inc. // MapR Confidential •  MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 93. 93 © 2018 MapR Technologies, Inc. // MapR Confidential https://mapr.com/blog/ MapR Blog
  • 94. 94 © 2018 MapR Technologies, Inc. // MapR Confidential MapR Data Platform Link to Code for this webinar is in appendix of the book. https://mapr.com/ebook/getting- started-with-apache-spark-v2/