SlideShare a Scribd company logo
Think ahead. Act now.
Think ahead. Act now.
Managing and querying
datasets with Data Factory,
Cosmos DB and Azure
Functions.
Marc Duiker
Think ahead. Act now.
The Case
A murder has happened in New York City on 29th Jan 2014.
The suspect most likely escaped by using a taxi.
It’s our job to find out which taxi the suspect could have used.
Think ahead. Act now.
Original data sources
NYPD Complaint Data
2014
NYC Taxi Trip Data
2014
https://data.cityofnewyork.us/Public-Safety/NYPD-
Complaint-Data-Historic/qgea-i56i
https://www.kaggle.com/kentonnlp/2014-new-york-
city-taxi-trips
5.6M rows
1.3 GB
15M rows
2.3 GB
Think ahead. Act now.
NYPD Complaint Data (trimmed down)
CSV with 39k records for Jan 2014
Attributes
• Date & time
• Offense classification (KY_CD)
• Latitude & longitude
• …
More details in NYPD_Complaint_Data_Column_Descriptions.csv
Think ahead. Act now.
Think ahead. Act now.
NYC Taxi Trip Data (trimmed down)
CSV with 477k records for 29th Jan 2014
Attributes
• Pickup date & time
• Pickup latitude & longitude
• …
Think ahead. Act now.
Think ahead. Act now.
What can we use on Azure?
Query
Azure FunctionsData FactoryCosmos DB
Storage TransferCsv
Think ahead. Act now.
Cosmos DB
Think ahead. Act now.
Cosmos DB: databases & collections
Account
(gabc-nyc-db)
Database
(nycdatabase)
Collection
(complaints)
Collection
(taxitrips)
Documents
{ .. }
Documents
{ .. }
SQL API
Think ahead. Act now.
Cosmos DB: GeoJSON
https://docs.microsoft.com/en-us/azure/cosmos-db/geospatial
Azure Cosmos DB supports indexing and querying of
geospatial point data that's represented using
the GeoJSON specification.
{
"type":"Point",
"coordinates":[-73.88, 40.76]
}
Think ahead. Act now.
Cosmos DB: Change feed
• Cosmos DB persists events about insertion and updates to
documents in the change feed.
Think ahead. Act now.
Cosmos DB
Think ahead. Act now.
Azure Data Factory (v2)
Source Sink
• Mapping
• Transforms
• Scheduling
• Throughput
Think ahead. Act now.
Data Factory: source schema
• When importing a csv DataFactory looks at the first line of
data to determine the data types.
• Inspect the data for empty and numeric values
• “”  empty value for String, Int64 or Double?
• 0  Int64 or Double?
• Sometimes numbers are categories (String).
Think ahead. Act now.
Azure Data Factory
https://datafactoryv2.azure.com/
Think ahead. Act now.
Azure Functions (Runtime 2)
• Serverless compute service to run code
on demand
• Support for various languages: C#, F#, Node.js, Java, or PHP
• Automatic scaling
• Pay-per-use
Think ahead. Act now.
Azure Functions Triggers
Think ahead. Act now.
Azure Functions
Think ahead. Act now.
Integrating the services
UpdateTaxiTripGeoData
NYPD Complaint
CSV
NYC Taxi Trip
CSV
Data Factory
Pipelines
Cosmos DB
Collections
GetComplaintsByOffenseId
GetTaxiTripsWithinRange
Lab 1
Lab 2
Think ahead. Act now.
Hands-on labs
https://github.com/XpiritBV/GABC2018_HandsOnLabs/

More Related Content

What's hot (17)

PPT
Open Source Databases And Gis
Kudos S.A.S
 
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
PDF
R statistics with mongo db
MongoDB
 
PDF
GeoMesa LocationTech DC
CCRinc
 
PDF
Scio
Neville Li
 
PPT
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB
 
PDF
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
PDF
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
PDF
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
PDF
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
Altinity Ltd
 
PPT
Introduction to MongoDB
Nosh Petigara
 
PDF
Locality Sensitive Hashing By Spark
Spark Summit
 
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
InfluxData
 
PPTX
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Altinity Ltd
 
PDF
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
PDF
Improving the usability of the Information system of land cover in Spain (SIOSE)
Benito Zaragozí
 
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
Open Source Databases And Gis
Kudos S.A.S
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
R statistics with mongo db
MongoDB
 
GeoMesa LocationTech DC
CCRinc
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB
 
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Scio - Moving to Google Cloud, A Spotify Story
Neville Li
 
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
Altinity Ltd
 
Introduction to MongoDB
Nosh Petigara
 
Locality Sensitive Hashing By Spark
Spark Summit
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
InfluxData
 
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Altinity Ltd
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
Improving the usability of the Information system of land cover in Spain (SIOSE)
Benito Zaragozí
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 

Similar to Managing and querying large data sets using Data Factory, Cosmos DB and Azure Functions (20)

PPTX
CosmosDB.pptx
Udaiappa Ramachandran
 
ODP
LOFAR - finding transients in the radio spectrum
Gijs Molenaar
 
PDF
CosmosDB for DBAs & Developers
Niko Neugebauer
 
PDF
Zero to 60 with Azure Cosmos DB
Adnan Hashmi
 
PPTX
Data Saturday 13 - Minnesota - Cosmos DB and Azure Functions.pptx
Luis Beltran
 
PPTX
Cosmos DB and Azure Functions A serverless database processing.pptx
icebeam7
 
PPTX
DataWeekender 4_2 Cosmos DB and Azure Functions- A serverless database proces...
Luis Beltran
 
PPTX
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
Central University of South Bihar
 
PPTX
Azure CosmosDb
Marco Parenzan
 
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
PDF
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Andre Essing
 
PPTX
cosmodb ppt project.pptxakfjhaasjfsdajjkfasd
Central University of South Bihar
 
PPTX
Festive Tech Calendar 2021
Callon Campbell
 
PPTX
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
Central University of South Bihar
 
PPTX
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
PDF
Geospatial Options in Apache Spark
Databricks
 
PPTX
Azure Cosmos DB - Azure Austin Meetup
Matias Quaranta
 
PDF
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
PowerSaturdayParis
 
PPTX
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Luis Beltran
 
PDF
Databases - beyond SQL : Cosmos DB (part 6)
Alexandre BERGERE
 
CosmosDB.pptx
Udaiappa Ramachandran
 
LOFAR - finding transients in the radio spectrum
Gijs Molenaar
 
CosmosDB for DBAs & Developers
Niko Neugebauer
 
Zero to 60 with Azure Cosmos DB
Adnan Hashmi
 
Data Saturday 13 - Minnesota - Cosmos DB and Azure Functions.pptx
Luis Beltran
 
Cosmos DB and Azure Functions A serverless database processing.pptx
icebeam7
 
DataWeekender 4_2 Cosmos DB and Azure Functions- A serverless database proces...
Luis Beltran
 
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
Central University of South Bihar
 
Azure CosmosDb
Marco Parenzan
 
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Andre Essing
 
cosmodb ppt project.pptxakfjhaasjfsdajjkfasd
Central University of South Bihar
 
Festive Tech Calendar 2021
Callon Campbell
 
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
Central University of South Bihar
 
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
 
Geospatial Options in Apache Spark
Databricks
 
Azure Cosmos DB - Azure Austin Meetup
Matias Quaranta
 
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
PowerSaturdayParis
 
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Luis Beltran
 
Databases - beyond SQL : Cosmos DB (part 6)
Alexandre BERGERE
 
Ad

More from Marc Duiker (8)

PDF
Take your Azure Functions to the next level with Durable Functions - Serverle...
Marc Duiker
 
PDF
Take your Azure Functions to the next level with Durable Functions - WAZUG
Marc Duiker
 
PDF
Put Your Web App on a Diet with Azure Functions
Marc Duiker
 
PDF
Take your azure functions to the next level with durable functions @ Experts ...
Marc Duiker
 
PDF
Orchestrate your Azure Functions with Durable Functions - AzureThursday Meetup
Marc Duiker
 
PPTX
Improving your vision with Azure Cognitive Services - /dev/070
Marc Duiker
 
PDF
Getting Started with Serverless Architectures using Azure Functions
Marc Duiker
 
PPTX
Improving your vision with Azure Cognitive Services - MixUG
Marc Duiker
 
Take your Azure Functions to the next level with Durable Functions - Serverle...
Marc Duiker
 
Take your Azure Functions to the next level with Durable Functions - WAZUG
Marc Duiker
 
Put Your Web App on a Diet with Azure Functions
Marc Duiker
 
Take your azure functions to the next level with durable functions @ Experts ...
Marc Duiker
 
Orchestrate your Azure Functions with Durable Functions - AzureThursday Meetup
Marc Duiker
 
Improving your vision with Azure Cognitive Services - /dev/070
Marc Duiker
 
Getting Started with Serverless Architectures using Azure Functions
Marc Duiker
 
Improving your vision with Azure Cognitive Services - MixUG
Marc Duiker
 
Ad

Recently uploaded (20)

PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PPTX
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
Open Source Milvus Vector Database v 2.6
Zilliz
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
Open Source Milvus Vector Database v 2.6
Zilliz
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Practical Applications of AI in Local Government
OnBoard
 

Managing and querying large data sets using Data Factory, Cosmos DB and Azure Functions

  • 1. Think ahead. Act now. Think ahead. Act now. Managing and querying datasets with Data Factory, Cosmos DB and Azure Functions. Marc Duiker
  • 2. Think ahead. Act now. The Case A murder has happened in New York City on 29th Jan 2014. The suspect most likely escaped by using a taxi. It’s our job to find out which taxi the suspect could have used.
  • 3. Think ahead. Act now. Original data sources NYPD Complaint Data 2014 NYC Taxi Trip Data 2014 https://data.cityofnewyork.us/Public-Safety/NYPD- Complaint-Data-Historic/qgea-i56i https://www.kaggle.com/kentonnlp/2014-new-york- city-taxi-trips 5.6M rows 1.3 GB 15M rows 2.3 GB
  • 4. Think ahead. Act now. NYPD Complaint Data (trimmed down) CSV with 39k records for Jan 2014 Attributes • Date & time • Offense classification (KY_CD) • Latitude & longitude • … More details in NYPD_Complaint_Data_Column_Descriptions.csv
  • 6. Think ahead. Act now. NYC Taxi Trip Data (trimmed down) CSV with 477k records for 29th Jan 2014 Attributes • Pickup date & time • Pickup latitude & longitude • …
  • 8. Think ahead. Act now. What can we use on Azure? Query Azure FunctionsData FactoryCosmos DB Storage TransferCsv
  • 9. Think ahead. Act now. Cosmos DB
  • 10. Think ahead. Act now. Cosmos DB: databases & collections Account (gabc-nyc-db) Database (nycdatabase) Collection (complaints) Collection (taxitrips) Documents { .. } Documents { .. } SQL API
  • 11. Think ahead. Act now. Cosmos DB: GeoJSON https://docs.microsoft.com/en-us/azure/cosmos-db/geospatial Azure Cosmos DB supports indexing and querying of geospatial point data that's represented using the GeoJSON specification. { "type":"Point", "coordinates":[-73.88, 40.76] }
  • 12. Think ahead. Act now. Cosmos DB: Change feed • Cosmos DB persists events about insertion and updates to documents in the change feed.
  • 13. Think ahead. Act now. Cosmos DB
  • 14. Think ahead. Act now. Azure Data Factory (v2) Source Sink • Mapping • Transforms • Scheduling • Throughput
  • 15. Think ahead. Act now. Data Factory: source schema • When importing a csv DataFactory looks at the first line of data to determine the data types. • Inspect the data for empty and numeric values • “”  empty value for String, Int64 or Double? • 0  Int64 or Double? • Sometimes numbers are categories (String).
  • 16. Think ahead. Act now. Azure Data Factory https://datafactoryv2.azure.com/
  • 17. Think ahead. Act now. Azure Functions (Runtime 2) • Serverless compute service to run code on demand • Support for various languages: C#, F#, Node.js, Java, or PHP • Automatic scaling • Pay-per-use
  • 18. Think ahead. Act now. Azure Functions Triggers
  • 19. Think ahead. Act now. Azure Functions
  • 20. Think ahead. Act now. Integrating the services UpdateTaxiTripGeoData NYPD Complaint CSV NYC Taxi Trip CSV Data Factory Pipelines Cosmos DB Collections GetComplaintsByOffenseId GetTaxiTripsWithinRange Lab 1 Lab 2
  • 21. Think ahead. Act now. Hands-on labs https://github.com/XpiritBV/GABC2018_HandsOnLabs/