SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Hendrik Frentrup, systemati.co
Maps and Meaning
Graph-based Entity Resolution
#UnifiedDataAnalytics #SparkAISummit
Maps and Meaning
Graph based Entity Resolution
3#UnifiedDataAnalytics #SparkAISummit
Source: Jordi Guzmán (creative commons)
Data
is the
new oil
Building Value Streams
Source: Malcolm Manners (creative commons)
Data Extraction
Data Refining
Data Warehousing
Data Pipeline
Source 1
Source 3
Source N
…
Visualisation
Presentation
Dashboards
Machine Learning
Statistical Analysis
Inference
Predictions
Data Extraction
Transformation
Integration
Data Modelling
Upstream integrations
Source 1
Source 3
Source N
…
First Order Transformation:
• Deduplication -> df.dictinct()
• Transformations -> df.withColumn(col, expr(col))
• Mapping -> df.withColumnRenamed(old, new)
Nth Order Transformation:
• Merge N Sources -> Entity Resolution
Second Order Transformation:
• Denormalisation -> lhs.join(rhs, key)
Outline
• Motivation
• Entity Resolution Example
• Graph-based Entity Resolution Algorithm
• Data Pipeline Architecture
• Implementation
– In GraphFrames (Python API)
– In GraphX (Scala API)
• The Role of Machine Learning in Entity Resolution
Example: Find Duplicates
• Merge records in your Address Book
ID First Name Last Name Email Mobile Number Phone Number
1 Harry Mulisch harry@mulisch.nl +31 101 1001
2 HKV Mulisch Harry.Mulish@gmail.com +31 666 7777
3 author@heaven.nl +31 101 1001
4 Harry Mulisch +31 123 4567 +31 666 7777
ID First Name Last Name Email Mobile Number Phone Number
1 Harry/HKV Mulisch harry@mulisch.nl,
Harry.Mulish@gmail.com,
author@heaven.nl
+31 101 1001,
+31 123 4567
+31 666 7777
…such as Google Contacts
ID First Name Last Name Email Mobile Number Phone Number Source
1 Harry Mulisch harry@mulisch.nl +31 101 1001 Phone
2 S Nadolny +49 899 9898 Phone
3 Harry Mulisch +31 123 4567 +31 666 7777 Phone
4 author@heaven.nl +31 101 1001 Gmail
5 Sten Nadolny sten@slow.de +49 899 9898 Gmail
6 Max Frisch max@andorra.ch Outlook
7 HKV Harry.Mulish@gmail.com +31 666 7777 Outlook
Example: Resolving records
Graph Algorithm Walkthrough
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Copyright 2019 © systemati.co
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Harry Mulisch/HKV
harry@mulisch.nl,
author@heaven.nl,
Harry.Mulish@gmail.com
+31 123 4567
+31 666 7777
+31 101 1001
3
4
Sten/S Nadolny
sten@slow.de
+49 899 9898
5
6
7
Max Frisch
max@andorra.ch
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
Entity Resolution Pipeline
Architecture
Source 1
Source 3
Source N
…
Extract
Data Hub/Lake/Warehouse
Clean
Records
Source
Copy
……
Consolidated
Nodes
Appended
records
Resolved
records
Resolve
Entities
Merge
Entities
Technical Implementation
Graphs in Apache Spark
GraphX GraphFrames
Python API
👍
Scala API
👍 👍
With GraphFrames
Create nodes
• Add an id column to the dataframe of records
+---+------------+-----------+-----------+---------+----------+--------------+
| id| ssn| email| phone| address| DoB| Name|
+---+------------+-----------+-----------+---------+----------+--------------+
| 0| 714-12-4462| len@sma.ll| 6088881234| ...| 15/4/1937| Lennie Small |
| 1| 481-33-1024| geo@mil.tn| 6077654980| ...| 15/4/1937| Goerge Milton|
Identifiers Attributes
from pyspark.sql.functions import monotonically_increasing_id
nodes = records.withColumn("id", monotonically_increasing_id())
Edge creation
match_cols = [”ssn", ”email"]
mirrorColNames = [f"_{col}" for col in records.columns]
mirror = records.toDF(*mirrorColNames)
mcond = [col(c) == col(f'_{c}') for c in match_cols]
cond = [(col("id") != col("_id")) & 
reduce(lambda x,y: x | y, mcond)]
edges = records.join(mirror, cond)
cond:
[Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))
Resolve entities and consolidation
• Connected Components
graph = gf.GraphFrame(nodes, edges)
sc.setCheckpointDir("/tmp/checkpoints")
cc = graph.connectedComponents()
entities = cc.groupby(”components”).collect_set(”name”)
• Consolidate Components
With GraphX
Strongly Typed Scala
• Defining the schema of our data
24
val record_schema = StructType( Seq(
StructField(name = ”id", dataType = LongType, nullable = false),
StructField(name = ”name", StringType, true),
StructField(name = ”email", StringType, true),
StructField(name = ”ssn", LongType, true),
StructField(name = ”attr", StringType, true)
))
Node creation
• Add an ID column to records
• Turn DataFrame into RDD
val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd
Edge creation
val mirrorColNames = for (col <- records.columns) yield "_"+col.toString
val mirror = records.toDF(mirrorColNames: _*)
def conditions(matchCols: Seq[String]): Column = {
col("id")=!=col("_id") &&
matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _)
}
val edges = records.join(mirror, conditions(Seq(”ssn", ”email”)))
val edgesRDD = edges
.select("id","_id")
.map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null))
.rdd
Resolve entities and consolidation
• Connected Components
val graph = Graph(nodesRDD, edgesRDD)
val cc = graph.connectedComponents()
val entities = cc.vertices.toDF()
val resolved_records = records.join(entities, $"id"===$"_1")
val res_records = resolved_records
.withColumnRenamed("_2", ”e_id")
.groupBy(”e_id")
.agg(collect_set($”name"))
• Consolidate Components
Resolve operation
Columns to match:
[“ssn”,”email”]
Input:
DataFrame
Output:
DataFrame
Evaluation
• Number of source records per entity
• Business logic:
– Conflicts (multiple SSNs)
• Distribution of matches
vs.
0
50
100
150
200
250
300
in one
source
in two
sources
in three
sources
in four
sources
Entities by Nr of Source
Evolving
Entity Resolution
Machine learning in Entity Resolution
• Pairwise comparison
– String matching / distance measures
– Incorporate temporal data into edge creation
{ 1, 0 }
or
P(match)=0.8762 H Muiisch Harry.Mulish@gmail.com
1 Harry Mulisch harry@mulisch.nl
• Edge creation is the most computationally
heavy step
Machine learning in Entity Resolution
• Structuring connected Data
• Partitioning of the graph based on
clustering of records
• Using weighted edges and learning a
classifier to evaluate links between
records
Feeding a knowledge graph
Human Interface:
• Analytics
• Forensics
• Discovery
• Iterative
Improvements:
• Data Quality
• Contextual
Information
• Use case
driven
Get started yourself
• GitHub Project: Resolver & Notebook:
– https://github.com/hendrikfrentrup/maps-meaning
• Docker container with pySpark & GraphFrames:
– https://hub.docker.com/r/hendrikfrentrup/pyspark-
graphframes
34
Key Takeaways
• Data pipeline coalesces into a single record table
• Connected Components at the core of resolving
• Edge creation is the expensive operation
• Batch operation over a single corpus
35
Thanks!
Any questions?
Comments?
Observations?
36
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PPTX
Data Analysis with Python Pandas
Neeru Mittal
 
PDF
Intro to Neo4j and Graph Databases
Neo4j
 
PDF
Row or Columnar Database
Biju Nair
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Spark overview
Lisa Hua
 
PPTX
Python Seaborn Data Visualization
Sourabh Sahu
 
PDF
Enterprise Knowledge Graph
Lukas Masuch
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PPTX
Pandas
Jyoti shukla
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PDF
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PPTX
Python pandas Library
Md. Sohag Miah
 
PDF
Pandas
maikroeder
 
PPTX
Mongo db intro.pptx
JWORKS powered by Ordina
 
PPTX
SHACL by example
Jose Emilio Labra Gayo
 
PPTX
DMPs are Dead. Welcome to the CDP Era.
mParticle
 
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
DataStax Academy
 
PDF
Cloudera Impala 1.0
Minwoo Kim
 
PDF
Building Applications with a Graph Database
Tobias Lindaaker
 
PDF
FIWARE Global Summit - The Scorpio NGSI-LD Broker: Features and Supported Arc...
FIWARE
 
Data Analysis with Python Pandas
Neeru Mittal
 
Intro to Neo4j and Graph Databases
Neo4j
 
Row or Columnar Database
Biju Nair
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark overview
Lisa Hua
 
Python Seaborn Data Visualization
Sourabh Sahu
 
Enterprise Knowledge Graph
Lukas Masuch
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Pandas
Jyoti shukla
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Python pandas Library
Md. Sohag Miah
 
Pandas
maikroeder
 
Mongo db intro.pptx
JWORKS powered by Ordina
 
SHACL by example
Jose Emilio Labra Gayo
 
DMPs are Dead. Welcome to the CDP Era.
mParticle
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
DataStax Academy
 
Cloudera Impala 1.0
Minwoo Kim
 
Building Applications with a Graph Database
Tobias Lindaaker
 
FIWARE Global Summit - The Scorpio NGSI-LD Broker: Features and Supported Arc...
FIWARE
 

Similar to Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX (20)

PPTX
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
PDF
Building Identity Graphs over Heterogeneous Data
Databricks
 
PDF
Large-Scale Malicious Domain Detection with Spark AI
Databricks
 
PDF
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
PDF
Graph-based Approaches for Organization Entity Resolution in MapReduce
Deepak K
 
PDF
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
javier ramirez
 
PPTX
Semantics 101
Kurt Cagle
 
PDF
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
TigerGraph
 
PPTX
Semantics 101
Kurt Cagle
 
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PPT
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
PDF
Tactical Data Science Tips: Python and Spark Together
Databricks
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PPTX
Managing Large Scale Financial Time-Series Data with Graphs
Objectivity
 
PDF
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
ErhardRahm
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
PPTX
Introducing DataWave
Data Works MD
 
PDF
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
PDF
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Building Identity Graphs over Heterogeneous Data
Databricks
 
Large-Scale Malicious Domain Detection with Spark AI
Databricks
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Graph-based Approaches for Organization Entity Resolution in MapReduce
Deepak K
 
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
javier ramirez
 
Semantics 101
Kurt Cagle
 
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
TigerGraph
 
Semantics 101
Kurt Cagle
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
Tactical Data Science Tips: Python and Spark Together
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Managing Large Scale Financial Time-Series Data with Graphs
Objectivity
 
Big dataintegration rahm-part3Scalable and privacy-preserving data integratio...
ErhardRahm
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Introducing DataWave
Data Works MD
 
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
Employing Graph Databases as a Standardization Model towards Addressing Heter...
Dippy Aggarwal
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PPTX
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PPTX
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PDF
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Kafka Use Cases Real-World Applications
Accentfuture
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Mynd company all details what they are doing a
AniketKadam40952
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Starbucks in the Indian market through its joint venture.
sales480687
 

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Hendrik Frentrup, systemati.co Maps and Meaning Graph-based Entity Resolution #UnifiedDataAnalytics #SparkAISummit
  • 3. Maps and Meaning Graph based Entity Resolution 3#UnifiedDataAnalytics #SparkAISummit Source: Jordi Guzmán (creative commons) Data is the new oil
  • 4. Building Value Streams Source: Malcolm Manners (creative commons) Data Extraction Data Refining Data Warehousing
  • 5. Data Pipeline Source 1 Source 3 Source N … Visualisation Presentation Dashboards Machine Learning Statistical Analysis Inference Predictions Data Extraction Transformation Integration Data Modelling
  • 6. Upstream integrations Source 1 Source 3 Source N … First Order Transformation: • Deduplication -> df.dictinct() • Transformations -> df.withColumn(col, expr(col)) • Mapping -> df.withColumnRenamed(old, new) Nth Order Transformation: • Merge N Sources -> Entity Resolution Second Order Transformation: • Denormalisation -> lhs.join(rhs, key)
  • 7. Outline • Motivation • Entity Resolution Example • Graph-based Entity Resolution Algorithm • Data Pipeline Architecture • Implementation – In GraphFrames (Python API) – In GraphX (Scala API) • The Role of Machine Learning in Entity Resolution
  • 8. Example: Find Duplicates • Merge records in your Address Book ID First Name Last Name Email Mobile Number Phone Number 1 Harry Mulisch [email protected] +31 101 1001 2 HKV Mulisch [email protected] +31 666 7777 3 [email protected] +31 101 1001 4 Harry Mulisch +31 123 4567 +31 666 7777 ID First Name Last Name Email Mobile Number Phone Number 1 Harry/HKV Mulisch [email protected], [email protected], [email protected] +31 101 1001, +31 123 4567 +31 666 7777
  • 9. …such as Google Contacts
  • 10. ID First Name Last Name Email Mobile Number Phone Number Source 1 Harry Mulisch [email protected] +31 101 1001 Phone 2 S Nadolny +49 899 9898 Phone 3 Harry Mulisch +31 123 4567 +31 666 7777 Phone 4 [email protected] +31 101 1001 Gmail 5 Sten Nadolny [email protected] +49 899 9898 Gmail 6 Max Frisch [email protected] Outlook 7 HKV [email protected] +31 666 7777 Outlook Example: Resolving records
  • 12. 2 1 Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 13. 2 1 Copyright 2019 © systemati.co Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 14. 2 1 Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 15. 2 1 Harry Mulisch/HKV [email protected], [email protected], [email protected] +31 123 4567 +31 666 7777 +31 101 1001 3 4 Sten/S Nadolny [email protected] +49 899 9898 5 6 7 Max Frisch [email protected] • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 16. Entity Resolution Pipeline Architecture Source 1 Source 3 Source N … Extract Data Hub/Lake/Warehouse Clean Records Source Copy …… Consolidated Nodes Appended records Resolved records Resolve Entities Merge Entities
  • 18. Graphs in Apache Spark GraphX GraphFrames Python API 👍 Scala API 👍 👍
  • 20. Create nodes • Add an id column to the dataframe of records +---+------------+-----------+-----------+---------+----------+--------------+ | id| ssn| email| phone| address| DoB| Name| +---+------------+-----------+-----------+---------+----------+--------------+ | 0| 714-12-4462| [email protected]| 6088881234| ...| 15/4/1937| Lennie Small | | 1| 481-33-1024| [email protected]| 6077654980| ...| 15/4/1937| Goerge Milton| Identifiers Attributes from pyspark.sql.functions import monotonically_increasing_id nodes = records.withColumn("id", monotonically_increasing_id())
  • 21. Edge creation match_cols = [”ssn", ”email"] mirrorColNames = [f"_{col}" for col in records.columns] mirror = records.toDF(*mirrorColNames) mcond = [col(c) == col(f'_{c}') for c in match_cols] cond = [(col("id") != col("_id")) & reduce(lambda x,y: x | y, mcond)] edges = records.join(mirror, cond) cond: [Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))
  • 22. Resolve entities and consolidation • Connected Components graph = gf.GraphFrame(nodes, edges) sc.setCheckpointDir("/tmp/checkpoints") cc = graph.connectedComponents() entities = cc.groupby(”components”).collect_set(”name”) • Consolidate Components
  • 24. Strongly Typed Scala • Defining the schema of our data 24 val record_schema = StructType( Seq( StructField(name = ”id", dataType = LongType, nullable = false), StructField(name = ”name", StringType, true), StructField(name = ”email", StringType, true), StructField(name = ”ssn", LongType, true), StructField(name = ”attr", StringType, true) ))
  • 25. Node creation • Add an ID column to records • Turn DataFrame into RDD val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd
  • 26. Edge creation val mirrorColNames = for (col <- records.columns) yield "_"+col.toString val mirror = records.toDF(mirrorColNames: _*) def conditions(matchCols: Seq[String]): Column = { col("id")=!=col("_id") && matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _) } val edges = records.join(mirror, conditions(Seq(”ssn", ”email”))) val edgesRDD = edges .select("id","_id") .map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null)) .rdd
  • 27. Resolve entities and consolidation • Connected Components val graph = Graph(nodesRDD, edgesRDD) val cc = graph.connectedComponents() val entities = cc.vertices.toDF() val resolved_records = records.join(entities, $"id"===$"_1") val res_records = resolved_records .withColumnRenamed("_2", ”e_id") .groupBy(”e_id") .agg(collect_set($”name")) • Consolidate Components
  • 28. Resolve operation Columns to match: [“ssn”,”email”] Input: DataFrame Output: DataFrame
  • 29. Evaluation • Number of source records per entity • Business logic: – Conflicts (multiple SSNs) • Distribution of matches vs. 0 50 100 150 200 250 300 in one source in two sources in three sources in four sources Entities by Nr of Source
  • 31. Machine learning in Entity Resolution • Pairwise comparison – String matching / distance measures – Incorporate temporal data into edge creation { 1, 0 } or P(match)=0.8762 H Muiisch [email protected] 1 Harry Mulisch [email protected] • Edge creation is the most computationally heavy step
  • 32. Machine learning in Entity Resolution • Structuring connected Data • Partitioning of the graph based on clustering of records • Using weighted edges and learning a classifier to evaluate links between records
  • 33. Feeding a knowledge graph Human Interface: • Analytics • Forensics • Discovery • Iterative Improvements: • Data Quality • Contextual Information • Use case driven
  • 34. Get started yourself • GitHub Project: Resolver & Notebook: – https://github.com/hendrikfrentrup/maps-meaning • Docker container with pySpark & GraphFrames: – https://hub.docker.com/r/hendrikfrentrup/pyspark- graphframes 34
  • 35. Key Takeaways • Data pipeline coalesces into a single record table • Connected Components at the core of resolving • Edge creation is the expensive operation • Batch operation over a single corpus 35
  • 37. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT