WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Hendrik Frentrup, systemati.co
Maps and Meaning
Graph-based Entity Resolution
#UnifiedDataAnalytics #SparkAISummit
Maps and Meaning
Graph based Entity Resolution
3#UnifiedDataAnalytics #SparkAISummit
Source: Jordi Guzmán (creative commons)
Data
is the
new oil
Building Value Streams
Source: Malcolm Manners (creative commons)
Data Extraction
Data Refining
Data Warehousing
Data Pipeline
Source 1
Source 3
Source N
…
Visualisation
Presentation
Dashboards
Machine Learning
Statistical Analysis
Inference
Predictions
Data Extraction
Transformation
Integration
Data Modelling
Upstream integrations
Source 1
Source 3
Source N
…
First Order Transformation:
• Deduplication -> df.dictinct()
• Transformations -> df.withColumn(col, expr(col))
• Mapping -> df.withColumnRenamed(old, new)
Nth Order Transformation:
• Merge N Sources -> Entity Resolution
Second Order Transformation:
• Denormalisation -> lhs.join(rhs, key)
Outline
• Motivation
• Entity Resolution Example
• Graph-based Entity Resolution Algorithm
• Data Pipeline Architecture
• Implementation
– In GraphFrames (Python API)
– In GraphX (Scala API)
• The Role of Machine Learning in Entity Resolution
Example: Find Duplicates
• Merge records in your Address Book
ID First Name Last Name Email Mobile Number Phone Number
1 Harry Mulisch harry@mulisch.nl +31 101 1001
2 HKV Mulisch Harry.Mulish@gmail.com +31 666 7777
3 author@heaven.nl +31 101 1001
4 Harry Mulisch +31 123 4567 +31 666 7777
ID First Name Last Name Email Mobile Number Phone Number
1 Harry/HKV Mulisch harry@mulisch.nl,
Harry.Mulish@gmail.com,
author@heaven.nl
+31 101 1001,
+31 123 4567
+31 666 7777
…such as Google Contacts
ID First Name Last Name Email Mobile Number Phone Number Source
1 Harry Mulisch harry@mulisch.nl +31 101 1001 Phone
2 S Nadolny +49 899 9898 Phone
3 Harry Mulisch +31 123 4567 +31 666 7777 Phone
4 author@heaven.nl +31 101 1001 Gmail
5 Sten Nadolny sten@slow.de +49 899 9898 Gmail
6 Max Frisch max@andorra.ch Outlook
7 HKV Harry.Mulish@gmail.com +31 666 7777 Outlook
Example: Resolving records
Graph Algorithm Walkthrough
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Copyright 2019 © systemati.co
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Harry Mulisch
harry@mulisch.nl
+31 101 1001
S Nadolny
+49 899 9898
3
4
Harry Mulisch
+31 123 4567
+31 666 7777
Sten Nadolny
sten@slow.de
+49 899 9898
5
author@heaven.nl
+31 101 1001
6
7
Max Frisch
max@andorra.ch
HKV
Harry.Mulish@gmail.com
+31 666 7777
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
2
1
Harry Mulisch/HKV
harry@mulisch.nl,
author@heaven.nl,
Harry.Mulish@gmail.com
+31 123 4567
+31 666 7777
+31 101 1001
3
4
Sten/S Nadolny
sten@slow.de
+49 899 9898
5
6
7
Max Frisch
max@andorra.ch
• Each record is a node
• Create edges based on
similarities
• Collect connected
nodes
• Consolidate
information in records
Entity Resolution Pipeline
Architecture
Source 1
Source 3
Source N
…
Extract
Data Hub/Lake/Warehouse
Clean
Records
Source
Copy
……
Consolidated
Nodes
Appended
records
Resolved
records
Resolve
Entities
Merge
Entities
Technical Implementation
Graphs in Apache Spark
GraphX GraphFrames
Python API
👍
Scala API
👍 👍
With GraphFrames
Create nodes
• Add an id column to the dataframe of records
+---+------------+-----------+-----------+---------+----------+--------------+
| id| ssn| email| phone| address| DoB| Name|
+---+------------+-----------+-----------+---------+----------+--------------+
| 0| 714-12-4462| len@sma.ll| 6088881234| ...| 15/4/1937| Lennie Small |
| 1| 481-33-1024| geo@mil.tn| 6077654980| ...| 15/4/1937| Goerge Milton|
Identifiers Attributes
from pyspark.sql.functions import monotonically_increasing_id
nodes = records.withColumn("id", monotonically_increasing_id())
Edge creation
match_cols = [”ssn", ”email"]
mirrorColNames = [f"_{col}" for col in records.columns]
mirror = records.toDF(*mirrorColNames)
mcond = [col(c) == col(f'_{c}') for c in match_cols]
cond = [(col("id") != col("_id")) & 
reduce(lambda x,y: x | y, mcond)]
edges = records.join(mirror, cond)
cond:
[Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))
Resolve entities and consolidation
• Connected Components
graph = gf.GraphFrame(nodes, edges)
sc.setCheckpointDir("/tmp/checkpoints")
cc = graph.connectedComponents()
entities = cc.groupby(”components”).collect_set(”name”)
• Consolidate Components
With GraphX
Strongly Typed Scala
• Defining the schema of our data
24
val record_schema = StructType( Seq(
StructField(name = ”id", dataType = LongType, nullable = false),
StructField(name = ”name", StringType, true),
StructField(name = ”email", StringType, true),
StructField(name = ”ssn", LongType, true),
StructField(name = ”attr", StringType, true)
))
Node creation
• Add an ID column to records
• Turn DataFrame into RDD
val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd
Edge creation
val mirrorColNames = for (col <- records.columns) yield "_"+col.toString
val mirror = records.toDF(mirrorColNames: _*)
def conditions(matchCols: Seq[String]): Column = {
col("id")=!=col("_id") &&
matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _)
}
val edges = records.join(mirror, conditions(Seq(”ssn", ”email”)))
val edgesRDD = edges
.select("id","_id")
.map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null))
.rdd
Resolve entities and consolidation
• Connected Components
val graph = Graph(nodesRDD, edgesRDD)
val cc = graph.connectedComponents()
val entities = cc.vertices.toDF()
val resolved_records = records.join(entities, $"id"===$"_1")
val res_records = resolved_records
.withColumnRenamed("_2", ”e_id")
.groupBy(”e_id")
.agg(collect_set($”name"))
• Consolidate Components
Resolve operation
Columns to match:
[“ssn”,”email”]
Input:
DataFrame
Output:
DataFrame
Evaluation
• Number of source records per entity
• Business logic:
– Conflicts (multiple SSNs)
• Distribution of matches
vs.
0
50
100
150
200
250
300
in one
source
in two
sources
in three
sources
in four
sources
Entities by Nr of Source
Evolving
Entity Resolution
Machine learning in Entity Resolution
• Pairwise comparison
– String matching / distance measures
– Incorporate temporal data into edge creation
{ 1, 0 }
or
P(match)=0.8762 H Muiisch Harry.Mulish@gmail.com
1 Harry Mulisch harry@mulisch.nl
• Edge creation is the most computationally
heavy step
Machine learning in Entity Resolution
• Structuring connected Data
• Partitioning of the graph based on
clustering of records
• Using weighted edges and learning a
classifier to evaluate links between
records
Feeding a knowledge graph
Human Interface:
• Analytics
• Forensics
• Discovery
• Iterative
Improvements:
• Data Quality
• Contextual
Information
• Use case
driven
Get started yourself
• GitHub Project: Resolver & Notebook:
– https://github.com/hendrikfrentrup/maps-meaning
• Docker container with pySpark & GraphFrames:
– https://hub.docker.com/r/hendrikfrentrup/pyspark-
graphframes
34
Key Takeaways
• Data pipeline coalesces into a single record table
• Connected Components at the core of resolving
• Edge creation is the expensive operation
• Batch operation over a single corpus
35
Thanks!
Any questions?
Comments?
Observations?
36
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Keeping Identity Graphs In Sync With Apache Spark
PPTX
Free Training: How to Build a Lakehouse
PDF
Fuzzy Matching on Apache Spark with Jennifer Shin
PPTX
Graph Analytics
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Top 5 Mistakes When Writing Spark Applications
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Keeping Identity Graphs In Sync With Apache Spark
Free Training: How to Build a Lakehouse
Fuzzy Matching on Apache Spark with Jennifer Shin
Graph Analytics
The Parquet Format and Performance Optimization Opportunities
Top 5 Mistakes When Writing Spark Applications

What's hot (20)

PPTX
Homomorphic encryption
PDF
An overview of Neo4j Internals
PPTX
Performance Optimizations in Apache Impala
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PPTX
Top 10 Cypher Tuning Tips & Tricks
PDF
Top 5 mistakes when writing Spark applications
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
PPTX
Frame - Feature Management for Productive Machine Learning
PDF
The Graph Traversal Programming Pattern
PPTX
Introduction to Redis
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Parquet performance tuning: the missing guide
PDF
Graph Machine Learning in Production with Neo4j
Homomorphic encryption
An overview of Neo4j Internals
Performance Optimizations in Apache Impala
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Top 10 Cypher Tuning Tips & Tricks
Top 5 mistakes when writing Spark applications
Apache Spark in Depth: Core Concepts, Architecture & Internals
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Frame - Feature Management for Productive Machine Learning
The Graph Traversal Programming Pattern
Introduction to Redis
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Best practices and lessons learnt from Running Apache NiFi at Renault
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Parquet performance tuning: the missing guide
Graph Machine Learning in Production with Neo4j
Ad

Similar to Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX (20)

PDF
managing big data
PDF
Domain-Driven Data at the O'Reilly Software Architecture Conference
PDF
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
PPTX
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
PDF
MongoDB Atlas Workshop - Singapore
PDF
Congressional PageRank: Graph Analytics of US Congress With Neo4j
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
PPTX
A Scalable Approach to Learn Semantic Models of Structured Sources
PPTX
ETL for Pros: Getting Data Into MongoDB
PDF
Data Science Keys to Open Up OpenNASA Datasets
PDF
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
PPTX
Webinar: Back to Basics: Thinking in Documents
PPTX
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
PPTX
JSON Data Modeling - July 2018 - Tulsa Techfest
PPTX
3 - Finding similar items
PDF
Webinar: Working with Graph Data in MongoDB
PDF
D3 meetup (Backbone and D3)
PPTX
Ggplot2 v3
PPTX
Presentation
KEY
Schema Design (Mongo Austin)
managing big data
Domain-Driven Data at the O'Reilly Software Architecture Conference
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
MongoDB Atlas Workshop - Singapore
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
A Scalable Approach to Learn Semantic Models of Structured Sources
ETL for Pros: Getting Data Into MongoDB
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Webinar: Back to Basics: Thinking in Documents
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
JSON Data Modeling - July 2018 - Tulsa Techfest
3 - Finding similar items
Webinar: Working with Graph Data in MongoDB
D3 meetup (Backbone and D3)
Ggplot2 v3
Presentation
Schema Design (Mongo Austin)
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPT
statistics analysis - topic 3 - describing data visually
DOCX
Factor Analysis Word Document Presentation
PDF
An essential collection of rules designed to help businesses manage and reduc...
PDF
Global Data and Analytics Market Outlook Report
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
ai agent creaction with langgraph_presentation_
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
New ISO 27001_2022 standard and the changes
PDF
Best Data Science Professional Certificates in the USA | IABAC
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
statistics analysis - topic 3 - describing data visually
Factor Analysis Word Document Presentation
An essential collection of rules designed to help businesses manage and reduc...
Global Data and Analytics Market Outlook Report
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
eGramSWARAJ-PPT Training Module for beginners
MBA JAPAN: 2025 the University of Waseda
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
ai agent creaction with langgraph_presentation_
Caseware_IDEA_Detailed_Presentation.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
New ISO 27001_2022 standard and the changes
Best Data Science Professional Certificates in the USA | IABAC
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
statsppt this is statistics ppt for giving knowledge about this topic

Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Hendrik Frentrup, systemati.co Maps and Meaning Graph-based Entity Resolution #UnifiedDataAnalytics #SparkAISummit
  • 3. Maps and Meaning Graph based Entity Resolution 3#UnifiedDataAnalytics #SparkAISummit Source: Jordi Guzmán (creative commons) Data is the new oil
  • 4. Building Value Streams Source: Malcolm Manners (creative commons) Data Extraction Data Refining Data Warehousing
  • 5. Data Pipeline Source 1 Source 3 Source N … Visualisation Presentation Dashboards Machine Learning Statistical Analysis Inference Predictions Data Extraction Transformation Integration Data Modelling
  • 6. Upstream integrations Source 1 Source 3 Source N … First Order Transformation: • Deduplication -> df.dictinct() • Transformations -> df.withColumn(col, expr(col)) • Mapping -> df.withColumnRenamed(old, new) Nth Order Transformation: • Merge N Sources -> Entity Resolution Second Order Transformation: • Denormalisation -> lhs.join(rhs, key)
  • 7. Outline • Motivation • Entity Resolution Example • Graph-based Entity Resolution Algorithm • Data Pipeline Architecture • Implementation – In GraphFrames (Python API) – In GraphX (Scala API) • The Role of Machine Learning in Entity Resolution
  • 8. Example: Find Duplicates • Merge records in your Address Book ID First Name Last Name Email Mobile Number Phone Number 1 Harry Mulisch [email protected] +31 101 1001 2 HKV Mulisch [email protected] +31 666 7777 3 [email protected] +31 101 1001 4 Harry Mulisch +31 123 4567 +31 666 7777 ID First Name Last Name Email Mobile Number Phone Number 1 Harry/HKV Mulisch [email protected], [email protected], [email protected] +31 101 1001, +31 123 4567 +31 666 7777
  • 9. …such as Google Contacts
  • 10. ID First Name Last Name Email Mobile Number Phone Number Source 1 Harry Mulisch [email protected] +31 101 1001 Phone 2 S Nadolny +49 899 9898 Phone 3 Harry Mulisch +31 123 4567 +31 666 7777 Phone 4 [email protected] +31 101 1001 Gmail 5 Sten Nadolny [email protected] +49 899 9898 Gmail 6 Max Frisch [email protected] Outlook 7 HKV [email protected] +31 666 7777 Outlook Example: Resolving records
  • 12. 2 1 Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 13. 2 1 Copyright 2019 © systemati.co Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 14. 2 1 Harry Mulisch [email protected] +31 101 1001 S Nadolny +49 899 9898 3 4 Harry Mulisch +31 123 4567 +31 666 7777 Sten Nadolny [email protected] +49 899 9898 5 [email protected] +31 101 1001 6 7 Max Frisch [email protected] HKV [email protected] +31 666 7777 • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 15. 2 1 Harry Mulisch/HKV [email protected], [email protected], [email protected] +31 123 4567 +31 666 7777 +31 101 1001 3 4 Sten/S Nadolny [email protected] +49 899 9898 5 6 7 Max Frisch [email protected] • Each record is a node • Create edges based on similarities • Collect connected nodes • Consolidate information in records
  • 16. Entity Resolution Pipeline Architecture Source 1 Source 3 Source N … Extract Data Hub/Lake/Warehouse Clean Records Source Copy …… Consolidated Nodes Appended records Resolved records Resolve Entities Merge Entities
  • 18. Graphs in Apache Spark GraphX GraphFrames Python API 👍 Scala API 👍 👍
  • 20. Create nodes • Add an id column to the dataframe of records +---+------------+-----------+-----------+---------+----------+--------------+ | id| ssn| email| phone| address| DoB| Name| +---+------------+-----------+-----------+---------+----------+--------------+ | 0| 714-12-4462| [email protected]| 6088881234| ...| 15/4/1937| Lennie Small | | 1| 481-33-1024| [email protected]| 6077654980| ...| 15/4/1937| Goerge Milton| Identifiers Attributes from pyspark.sql.functions import monotonically_increasing_id nodes = records.withColumn("id", monotonically_increasing_id())
  • 21. Edge creation match_cols = [”ssn", ”email"] mirrorColNames = [f"_{col}" for col in records.columns] mirror = records.toDF(*mirrorColNames) mcond = [col(c) == col(f'_{c}') for c in match_cols] cond = [(col("id") != col("_id")) & reduce(lambda x,y: x | y, mcond)] edges = records.join(mirror, cond) cond: [Column<b'((NOT (id = _id)) AND (((ssn = _ssn) OR (email = _email))
  • 22. Resolve entities and consolidation • Connected Components graph = gf.GraphFrame(nodes, edges) sc.setCheckpointDir("/tmp/checkpoints") cc = graph.connectedComponents() entities = cc.groupby(”components”).collect_set(”name”) • Consolidate Components
  • 24. Strongly Typed Scala • Defining the schema of our data 24 val record_schema = StructType( Seq( StructField(name = ”id", dataType = LongType, nullable = false), StructField(name = ”name", StringType, true), StructField(name = ”email", StringType, true), StructField(name = ”ssn", LongType, true), StructField(name = ”attr", StringType, true) ))
  • 25. Node creation • Add an ID column to records • Turn DataFrame into RDD val nodesRDD = records.map(r => (r.getAs[VertexId]("id"), 1)).rdd
  • 26. Edge creation val mirrorColNames = for (col <- records.columns) yield "_"+col.toString val mirror = records.toDF(mirrorColNames: _*) def conditions(matchCols: Seq[String]): Column = { col("id")=!=col("_id") && matchCols.map(c => col(c)===col("_"+c)).reduce(_ || _) } val edges = records.join(mirror, conditions(Seq(”ssn", ”email”))) val edgesRDD = edges .select("id","_id") .map(r => Edge(r.getAs[VertexId](0),r.getAs[VertexId](1),null)) .rdd
  • 27. Resolve entities and consolidation • Connected Components val graph = Graph(nodesRDD, edgesRDD) val cc = graph.connectedComponents() val entities = cc.vertices.toDF() val resolved_records = records.join(entities, $"id"===$"_1") val res_records = resolved_records .withColumnRenamed("_2", ”e_id") .groupBy(”e_id") .agg(collect_set($”name")) • Consolidate Components
  • 28. Resolve operation Columns to match: [“ssn”,”email”] Input: DataFrame Output: DataFrame
  • 29. Evaluation • Number of source records per entity • Business logic: – Conflicts (multiple SSNs) • Distribution of matches vs. 0 50 100 150 200 250 300 in one source in two sources in three sources in four sources Entities by Nr of Source
  • 31. Machine learning in Entity Resolution • Pairwise comparison – String matching / distance measures – Incorporate temporal data into edge creation { 1, 0 } or P(match)=0.8762 H Muiisch [email protected] 1 Harry Mulisch [email protected] • Edge creation is the most computationally heavy step
  • 32. Machine learning in Entity Resolution • Structuring connected Data • Partitioning of the graph based on clustering of records • Using weighted edges and learning a classifier to evaluate links between records
  • 33. Feeding a knowledge graph Human Interface: • Analytics • Forensics • Discovery • Iterative Improvements: • Data Quality • Contextual Information • Use case driven
  • 34. Get started yourself • GitHub Project: Resolver & Notebook: – https://github.com/hendrikfrentrup/maps-meaning • Docker container with pySpark & GraphFrames: – https://hub.docker.com/r/hendrikfrentrup/pyspark- graphframes 34
  • 35. Key Takeaways • Data pipeline coalesces into a single record table • Connected Components at the core of resolving • Edge creation is the expensive operation • Batch operation over a single corpus 35
  • 37. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT