Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Hendrik Frentrup, systemati.co
Maps and Meaning
Graph-based Entity Resolution
#UnifiedDataAnalytics #SparkAISummit

Maps and Meaning
Graph based Entity Resolution
3#UnifiedDataAnalytics #SparkAISummit
Source: Jordi Guzmán (creative commons)
Data
is the
new oil

Building Value Streams
Source: Malcolm Manners (creative commons)
Data Extraction
Data Refining
Data Warehousing

Data Pipeline
Source 1
Source 3
Source N
…
Visualisation
Presentation
Dashboards
Machine Learning
Statistical Analysis
Inference
Predictions
Data Extraction
Transformation
Integration
Data Modelling

Upstream integrations
Source 1
Source 3
Source N
…
First Order Transformation:
• Deduplication -> df.dictinct()
• Transformations -> df.withColumn(col, expr(col))
• Mapping -> df.withColumnRenamed(old, new)
Nth Order Transformation:
• Merge N Sources -> Entity Resolution
Second Order Transformation:
• Denormalisation -> lhs.join(rhs, key)

Outline
• Motivation
• Entity Resolution Example
• Graph-based Entity Resolution Algorithm
• Data Pipeline Architecture
• Implementation
– In GraphFrames (Python API)
– In GraphX (Scala API)
• The Role of Machine Learning in Entity Resolution

Example: Find Duplicates
• Merge records in your Address Book
ID First Name Last Name Email Mobile Number Phone Number
1 Harry Mulisch harry@mulisch.nl +31 101 1001
2 HKV Mulisch Harry.Mulish@gmail.com +31 666 7777
3 author@heaven.nl +31 101 1001
4 Harry Mulisch +31 123 4567 +31 666 7777
ID First Name Last Name Email Mobile Number Phone Number
1 Harry/HKV Mulisch harry@mulisch.nl,
Harry.Mulish@gmail.com,
author@heaven.nl
+31 101 1001,
+31 123 4567
+31 666 7777

ID First Name Last Name Email Mobile Number Phone Number Source
1 Harry Mulisch harry@mulisch.nl +31 101 1001 Phone
2 S Nadolny +49 899 9898 Phone
3 Harry Mulisch +31 123 4567 +31 666 7777 Phone
4 author@heaven.nl +31 101 1001 Gmail
5 Sten Nadolny sten@slow.de +49 899 9898 Gmail
6 Max Frisch max@andorra.ch Outlook
7 HKV Harry.Mulish@gmail.com +31 666 7777 Outlook
Example: Resolving records

2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records

2
1
Copyright 2019 © systemati.co
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
+31 666 7777
similarities
nodes
• Consolidate

2
1
Harry Mulisch/HKV
harry@mulisch.nl,
author@heaven.nl,
+31 123 4567
+31 666 7777
+31 101 1001
3
4
Sten/S Nadolny
sten@slow.de
+49 899 9898
5
6
7
Max Frisch
max@andorra.ch
similarities
nodes
• Consolidate

Entity Resolution Pipeline
Architecture
Source 1
Source 3
Source N
…
Extract
Data Hub/Lake/Warehouse
Clean
Records
Source
Copy
……
Consolidated
Nodes
Appended
records
Resolved
records
Resolve
Entities
Merge
Entities

Graphs in Apache Spark
GraphX GraphFrames
Python API
👍
Scala API
👍 👍

Create nodes
• Add an id column to the dataframe of records
+---+------------+-----------+-----------+---------+----------+--------------+
| id| ssn| email| phone| address| DoB| Name|
+---+------------+-----------+-----------+---------+----------+--------------+
| 0| 714-12-4462| len@sma.ll| 6088881234| ...| 15/4/1937| Lennie Small |
| 1| 481-33-1024| geo@mil.tn| 6077654980| ...| 15/4/1937| Goerge Milton|
Identifiers Attributes
from pyspark.sql.functions import monotonically_increasing_id
nodes = records.withColumn("id", monotonically_increasing_id())

Edge creation
match_cols = [”ssn", ”email"]
mirrorColNames = [f"_{col}" for col in records.columns]
mirror = records.toDF(*mirrorColNames)
mcond = [col(c) == col(f'_{c}') for c in match_cols]
cond = [(col("id") != col("_id")) &
reduce(lambda x,y: x | y, mcond)]
edges = records.join(mirror, cond)
cond:
[Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))

Resolve entities and consolidation
• Connected Components
graph = gf.GraphFrame(nodes, edges)
sc.setCheckpointDir("/tmp/checkpoints")
cc = graph.connectedComponents()
entities = cc.groupby(”components”).collect_set(”name”)
• Consolidate Components

Strongly Typed Scala
• Defining the schema of our data
24
val record_schema = StructType( Seq(
StructField(name = ”id", dataType = LongType, nullable = false),
StructField(name = ”name", StringType, true),
StructField(name = ”email", StringType, true),
StructField(name = ”ssn", LongType, true),
StructField(name = ”attr", StringType, true)
))

Node creation
• Add an ID column to records
• Turn DataFrame into RDD
val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd

Edge creation
val mirrorColNames = for (col <- records.columns) yield "_"+col.toString
val mirror = records.toDF(mirrorColNames: _*)
def conditions(matchCols: Seq[String]): Column = {
col("id")=!=col("_id") &&
matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _)
}
val edges = records.join(mirror, conditions(Seq(”ssn", ”email”)))
val edgesRDD = edges
.select("id","_id")
.map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null))
.rdd

Resolve entities and consolidation
• Connected Components
val graph = Graph(nodesRDD, edgesRDD)
val cc = graph.connectedComponents()
val entities = cc.vertices.toDF()
val resolved_records = records.join(entities, $"id"===$"_1")
val res_records = resolved_records
.withColumnRenamed("_2", ”e_id")
.groupBy(”e_id")
.agg(collect_set($”name"))
• Consolidate Components

Resolve operation
Columns to match:
[“ssn”,”email”]
Input:
DataFrame
Output:
DataFrame

Evaluation
• Number of source records per entity
• Business logic:
– Conflicts (multiple SSNs)
• Distribution of matches
vs.
0
50
100
150
200
250
300
in one
source
in two
sources
in three
sources
in four
sources
Entities by Nr of Source

Machine learning in Entity Resolution
• Pairwise comparison
– String matching / distance measures
– Incorporate temporal data into edge creation
{ 1, 0 }
or
P(match)=0.8762 H Muiisch Harry.Mulish@gmail.com
1 Harry Mulisch harry@mulisch.nl
• Edge creation is the most computationally
heavy step

Machine learning in Entity Resolution
• Structuring connected Data
• Partitioning of the graph based on
clustering of records
• Using weighted edges and learning a
classifier to evaluate links between
records

Feeding a knowledge graph
Human Interface:
• Analytics
• Forensics
• Discovery
• Iterative
Improvements:
• Data Quality
• Contextual
Information
• Use case
driven

Get started yourself
• GitHub Project: Resolver & Notebook:
– https://github.com/hendrikfrentrup/maps-meaning
• Docker container with pySpark & GraphFrames:
– https://hub.docker.com/r/hendrikfrentrup/pyspark-
graphframes
34

Key Takeaways
• Data pipeline coalesces into a single record table
• Connected Components at the core of resolving
• Edge creation is the expensive operation
• Batch operation over a single corpus
35

Thanks!
Any questions?
Comments?
Observations?
36

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

More Related Content

What's hot (20)

Similar to Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX (20)

More from Databricks (20)

Recently uploaded (20)

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX