SlideShare a Scribd company logo
Sponsored & Brought to you by
Analyzing StackExchange data with Azure
Data Lake
Tom Kerkhove
http://www.twitter.com/TomKerkhove
https://be.linkedin.com/in/tomkerkhove
Analysing StackExchange data
with Azure Data Lake
Analysing StackExchange data with Azure Data Lake
Nice to meet you
Tom KERKHOVE
➔ Integration Professional
➔ IoT Competency Lead
➔ Windows Development &
Microsoft Azure MVP
tom.kerkhove@codit.eu
+32 473 701 074
@TomKerkhove
be.linkedin.com/in/tomkerkhove
github.com/tomkerkhove
Agenda
• Why should we care about Big Data?
• Big Data in Azure
• Azure Data Lake
• Demo
• Q & A
4
Analyzing StackExchange data with Azure Data Lake
Integration of ThingsInternet of Things
6
Connect and scale
with efficiency
Analyze and act
on new data
Integrate and transform
business processes
Event producers & gateways Ingestion & transformation Report, Act, Predict
Microsoft Patterns & Practices – IoT Journey
10
11
Cluster Management
12
Languages
Platform Services
Infrastructure Services
Web Apps
Mobile
Apps
API
Management
API Apps
Logic Apps
Notification
Hubs
Content
Delivery
Network (CDN)
Media
Services
BizTalk
Services
Hybrid
Connections
Service Bus
Storage
Queues
Hybrid
Operations
Backup
StorSimple
Azure Site
Recovery
Import/Export
SQL
Database
DocumentDB
Redis
Cache
Azure
Search
Storage
Tables
Data
Warehouse Azure AD
Health Monitoring
AD Privileged
Identity
Management
Operational
Analytics
Cloud
Services
Batch
RemoteApp
Service
Fabric
Visual Studio
App
Insights
Azure
SDK
VS Online
Domain Services
HDInsight Machine
Learning
Stream
Analytics
Data
Factory
Event
Hubs
Mobile
Engagement
Data
Lake
IoT Hub
Data
Catalog
Security &
Management
Azure Active
Directory
Multi-Factor
Authentication
Automation
Portal
Key Vault
Store/
Marketplace
VM Image Gallery
& VM Depot
Azure AD
B2C
Scheduler
Overview in Azure
14
DocumentDB
Data Factory Stream Analytics Data Lake HDInsight Data Lake
(Store & Analytics)
Virtual Machine
IoT Hub SQL Data
Warehouse
SQL DatabaseStorageEvent Hubs
Document Db
Data Ingestion Data Storage
Data Pipelines
Machine Learning
Data Analytics
Cortana Analytics Suite
16
Analysing Big Data in Azure
Azure Data Lake Family
HDInsight Data Lake Store Data Lake Analytics
• Unlimited storage
• WebHDFS Store
• Managed cluster service
• Open-source technology
• Runs on Windows or Linux
• Managed job service
• U-SQL batch-processing
Azure Data Lake Store
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
18
Characteristics
➔ Data Warehousing
➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-
Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard
because of transforming the
data
19
Data Lake vs DataWarehousing
➔ Data Lake
➔ Raw data
(unstructured/semi-structured/structured)
➔ “Dump” all your data in the
lake
➔ Data scientists will
interpret data from the lake
➔ Without metadata, turns in
a data swamp pretty fast
20Martin Fowler on Data Lake & Data Warehouses(link)
Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ Don’t worry about scale
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
21
Writing U-SQL scripts
22
Extract from data source by
using built-in or custom
extractors.
Transform / Analyse the data
using SQL-syntax, in-line C# or
C# method calls
Output the result to a data
source by using built-in or
custom extractors
23
Data Lake Analytics - Data Sources
U-SQL
Query Query
Azure
Storage Blobs
Azure
Data Lake Store
Azure
SQL Database
Azure
SQL Data Warehouse
Azure SQL
in VMs
Azure Data Lake Analytics
25
Meet StackExchange
➔ Over 280 subwebsites
➔ 150+ GB of open-source data
➔ Different kinds of data
➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set
What AreWe GoingTo Do?
• Downloading the
original data set
Acquiring The
Data
• Upload data set to
Azure
• Determine what
service to use
Moving The
Data • Merging data from
each site into one
file
• Conversion from
XML to CSV
Aggregating
The Data
• Run business logic
on it
• Attempt to gain
knowledge from it
Analyzing The
Data • Visualize what we’ve
learned
Visualizing The
Data
27
Azure Data Lake tools forVisual Studio
➔ Projects / Solutions / Source control
➔ Store Explorer
➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ JobVisualizer
➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
➔ Off-Line execution
28
Integration with Azure Services
➔ Integrate in your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL query within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
29
Pricing
➔ Data Lake Store
➔ $0,08/GB stored per month
➔ $0,14 per 1M transactions
• 1 transaction is block of up to 128 kB
➔ Egress will be billed but not know yet
➔ Data Lake Analytics
➔ $0,05 per job
➔ $0,05 per minute per Analytics Unit for processing time
30
Azure Data Lake Store vs Blob Storage
31
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage RA-GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight
32
Summary
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow
➔ Data Swamps
➔ Data Lake Analytics
➔ No cluster management
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Azure Data Lake family and it’s easy!
Analyzing StackExchange data with Azure Data Lake
35
36
37
Ad

Recommended

PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
Tom Kerkhove
 
PDF
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
PDF
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
PPTX
A lap around Azure Data Factory
BizTalk360
 
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
PPTX
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
PPTX
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
PPTX
Azure data factory
David Giard
 
PPTX
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
PDF
Azure Data Factory v2
inovex GmbH
 
PPTX
Modern data warehouse
Rakesh Jayaram
 
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
PPTX
Intro to Azure Data Factory v1
Eric Bragas
 
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
PPTX
Azure Data Factory for Azure Data Week
Mark Kromer
 
PPTX
Introduction to Azure Databricks
James Serra
 
PPTX
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
PDF
Spark as a Service with Azure Databricks
Lace Lofranco
 
PDF
Azure Data Factory presentation with links
Chris Testa-O'Neill
 
PDF
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
PDF
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
MS Cloud Summit
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PPTX
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Mark Kromer
 
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
PPTX
Azure Data Factory
HARIHARAN R
 
PPTX
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 

More Related Content

What's hot (20)

PPTX
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
PDF
Azure Data Factory v2
inovex GmbH
 
PPTX
Modern data warehouse
Rakesh Jayaram
 
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
PPTX
Intro to Azure Data Factory v1
Eric Bragas
 
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
PPTX
Azure Data Factory for Azure Data Week
Mark Kromer
 
PPTX
Introduction to Azure Databricks
James Serra
 
PPTX
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
PDF
Spark as a Service with Azure Databricks
Lace Lofranco
 
PDF
Azure Data Factory presentation with links
Chris Testa-O'Neill
 
PDF
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
PDF
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
MS Cloud Summit
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PPTX
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Mark Kromer
 
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
PPTX
Azure Data Factory
HARIHARAN R
 
PPTX
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
Azure Data Factory v2
inovex GmbH
 
Modern data warehouse
Rakesh Jayaram
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Part 3 - Modern Data Warehouse with Azure Synapse
Nilesh Gule
 
Intro to Azure Data Factory v1
Eric Bragas
 
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
Azure Data Factory for Azure Data Week
Mark Kromer
 
Introduction to Azure Databricks
James Serra
 
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
Spark as a Service with Azure Databricks
Lace Lofranco
 
Azure Data Factory presentation with links
Chris Testa-O'Neill
 
Unleash the power of Azure Data Factory
Sergio Zenatti Filho
 
J1 T1 4 - Azure Data Factory vs SSIS - Regis Baccaro
MS Cloud Summit
 
Azure Data Factory v2
Sergio Zenatti Filho
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Mark Kromer
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
Azure Data Factory
HARIHARAN R
 
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 

Viewers also liked (20)

PPTX
Azure Data Lake and U-SQL
Michael Rys
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PPTX
Big data architectures and the data lake
James Serra
 
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
PPTX
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
PDF
Scaling MongoDB in the cloud with Microsoft Azure
Ivan Fioravanti
 
PPTX
MongoDB on Azure - Tips, Tricks and Examples
MongoDB
 
PDF
Social media analytics using Azure Technologies
Koray Kocabas
 
PDF
Power bi ea content pack v0.1
Luca Mauri
 
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Intro (SQLBits 2016)
Michael Rys
 
PPTX
Microsoft's Hadoop Story
Michael Rys
 
PPTX
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
PPTX
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
Azure Data Lake and U-SQL
Michael Rys
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big data architectures and the data lake
James Serra
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Jason L Brugger
 
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
Scaling MongoDB in the cloud with Microsoft Azure
Ivan Fioravanti
 
MongoDB on Azure - Tips, Tricks and Examples
MongoDB
 
Social media analytics using Azure Technologies
Koray Kocabas
 
Power bi ea content pack v0.1
Luca Mauri
 
U-SQL Meta Data Catalog (SQLBits 2016)
Michael Rys
 
U-SQL Intro (SQLBits 2016)
Michael Rys
 
Microsoft's Hadoop Story
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
Using C# with U-SQL (SQLBits 2016)
Michael Rys
 
U-SQL Does SQL (SQLBits 2016)
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
Ad

Similar to Analyzing StackExchange data with Azure Data Lake (20)

PPTX
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Codit
 
PPTX
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Tom Kerkhove
 
PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
 
PPTX
Designing big data analytics solutions on azure
Mohamed Tawfik
 
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
PDF
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
PPTX
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
PDF
Talavant Data Lake Analytics
Sean Forgatch
 
PPTX
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PDF
Trivadis Azure Data Lake
Trivadis
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PPTX
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
Dustin Vannoy
 
PDF
Prague data management meetup 2018-03-27
Martin Bém
 
PPTX
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
PPTX
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Codit
 
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Tom Kerkhove
 
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
 
Designing big data analytics solutions on azure
Mohamed Tawfik
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
1 Introduction to Microsoft data platform analytics for release
Jen Stirrup
 
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Talavant Data Lake Analytics
Sean Forgatch
 
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Trivadis Azure Data Lake
Trivadis
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PASS_Summit_2019_Azure_Storage_Options_for_Analytics
Dustin Vannoy
 
Prague data management meetup 2018-03-27
Martin Bém
 
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 
An intro to Azure Data Lake
Rick van den Bosch
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Ad

More from BizTalk360 (20)

PPTX
Optimise Business Activity Tracking – Insights from Smurfit Kappa
BizTalk360
 
PPTX
Optimise Business Activity Tracking – Insights from Smurfit Kappa
BizTalk360
 
PPTX
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
BizTalk360
 
PPTX
Integration Monday - Logic Apps: Development Experiences
BizTalk360
 
PPTX
Integration Monday - BizTalk Migrator Deep Dive
BizTalk360
 
PPTX
Testing for Logic App Solutions | Integration Monday
BizTalk360
 
PPTX
No-Slides
BizTalk360
 
PPTX
System Integration using Reactive Programming | Integration Monday
BizTalk360
 
PPTX
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
BizTalk360
 
PPTX
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
BizTalk360
 
PPTX
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
BizTalk360
 
PPTX
Integration-Monday-Infrastructure-As-Code-With-Terraform
BizTalk360
 
PDF
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
BizTalk360
 
PPTX
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
BizTalk360
 
PPTX
Integration-Monday-Building-Stateful-Workloads-Kubernetes
BizTalk360
 
PPTX
Integration-Monday-Logic-Apps-Tips-Tricks
BizTalk360
 
PPTX
Integration-Monday-Terraform-Serverless
BizTalk360
 
PPTX
Integration-Monday-Microsoft-Power-Platform
BizTalk360
 
PDF
One name unify them all
BizTalk360
 
PPTX
Securely Publishing Azure Services
BizTalk360
 
Optimise Business Activity Tracking – Insights from Smurfit Kappa
BizTalk360
 
Optimise Business Activity Tracking – Insights from Smurfit Kappa
BizTalk360
 
What's inside "migrating to biz talk server 2020" Book (BizTalk360 Webinar)
BizTalk360
 
Integration Monday - Logic Apps: Development Experiences
BizTalk360
 
Integration Monday - BizTalk Migrator Deep Dive
BizTalk360
 
Testing for Logic App Solutions | Integration Monday
BizTalk360
 
No-Slides
BizTalk360
 
System Integration using Reactive Programming | Integration Monday
BizTalk360
 
Building workflow solution with Microsoft Azure and Cloud | Integration Monday
BizTalk360
 
Serverless Minimalism: How to architect your apps to save 98% on your Azure b...
BizTalk360
 
Migrating BizTalk Solutions to Azure: Mapping Messages | Integration Monday
BizTalk360
 
Integration-Monday-Infrastructure-As-Code-With-Terraform
BizTalk360
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
BizTalk360
 
Integration-Monday-Serverless-Slackbots-with-Azure-Durable-Functions
BizTalk360
 
Integration-Monday-Building-Stateful-Workloads-Kubernetes
BizTalk360
 
Integration-Monday-Logic-Apps-Tips-Tricks
BizTalk360
 
Integration-Monday-Terraform-Serverless
BizTalk360
 
Integration-Monday-Microsoft-Power-Platform
BizTalk360
 
One name unify them all
BizTalk360
 
Securely Publishing Azure Services
BizTalk360
 

Recently uploaded (20)

PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PPTX
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PDF
Open Source Milvus Vector Database v 2.6
Zilliz
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
Kubernetes - Architecture & Components.pdf
geethak285
 
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
The Growing Value and Application of FME & GenAI
Safe Software
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Open Source Milvus Vector Database v 2.6
Zilliz
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Analyzing StackExchange data with Azure Data Lake

  • 1. Sponsored & Brought to you by Analyzing StackExchange data with Azure Data Lake Tom Kerkhove http://www.twitter.com/TomKerkhove https://be.linkedin.com/in/tomkerkhove
  • 2. Analysing StackExchange data with Azure Data Lake Analysing StackExchange data with Azure Data Lake
  • 3. Nice to meet you Tom KERKHOVE ➔ Integration Professional ➔ IoT Competency Lead ➔ Windows Development & Microsoft Azure MVP [email protected] +32 473 701 074 @TomKerkhove be.linkedin.com/in/tomkerkhove github.com/tomkerkhove
  • 4. Agenda • Why should we care about Big Data? • Big Data in Azure • Azure Data Lake • Demo • Q & A 4
  • 7. Connect and scale with efficiency Analyze and act on new data Integrate and transform business processes
  • 8. Event producers & gateways Ingestion & transformation Report, Act, Predict
  • 9. Microsoft Patterns & Practices – IoT Journey
  • 10. 10
  • 13. Platform Services Infrastructure Services Web Apps Mobile Apps API Management API Apps Logic Apps Notification Hubs Content Delivery Network (CDN) Media Services BizTalk Services Hybrid Connections Service Bus Storage Queues Hybrid Operations Backup StorSimple Azure Site Recovery Import/Export SQL Database DocumentDB Redis Cache Azure Search Storage Tables Data Warehouse Azure AD Health Monitoring AD Privileged Identity Management Operational Analytics Cloud Services Batch RemoteApp Service Fabric Visual Studio App Insights Azure SDK VS Online Domain Services HDInsight Machine Learning Stream Analytics Data Factory Event Hubs Mobile Engagement Data Lake IoT Hub Data Catalog Security & Management Azure Active Directory Multi-Factor Authentication Automation Portal Key Vault Store/ Marketplace VM Image Gallery & VM Depot Azure AD B2C Scheduler
  • 14. Overview in Azure 14 DocumentDB Data Factory Stream Analytics Data Lake HDInsight Data Lake (Store & Analytics) Virtual Machine IoT Hub SQL Data Warehouse SQL DatabaseStorageEvent Hubs Document Db Data Ingestion Data Storage Data Pipelines Machine Learning Data Analytics
  • 16. 16
  • 17. Analysing Big Data in Azure Azure Data Lake Family HDInsight Data Lake Store Data Lake Analytics • Unlimited storage • WebHDFS Store • Managed cluster service • Open-source technology • Runs on Windows or Linux • Managed job service • U-SQL batch-processing
  • 18. Azure Data Lake Store ➔ WebHDFS compatible ➔ Any size ➔ Any format as-is ➔ Write-once-read-many ➔ Enterprise-grade security ➔ Thé big data store in Azure 18
  • 19. Characteristics ➔ Data Warehousing ➔ Structured data ➔ Defined set of schemas ➔ Requires Extract-Transform- Load (ETL) before storing ➔ Known for some of us ➔ Exploratory analysis is hard because of transforming the data 19 Data Lake vs DataWarehousing ➔ Data Lake ➔ Raw data (unstructured/semi-structured/structured) ➔ “Dump” all your data in the lake ➔ Data scientists will interpret data from the lake ➔ Without metadata, turns in a data swamp pretty fast
  • 20. 20Martin Fowler on Data Lake & Data Warehouses(link)
  • 21. Azure Data Lake Analytics ➔ Run analytics jobs on managed clusters ➔ Don’t worry about scale ➔ Written in U-SQL ➔ SQL Syntax ➔ Extensibility in C# ➔ Easily scaled with Analytics Units ➔ Pay for processing time only 21
  • 22. Writing U-SQL scripts 22 Extract from data source by using built-in or custom extractors. Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls Output the result to a data source by using built-in or custom extractors
  • 23. 23
  • 24. Data Lake Analytics - Data Sources U-SQL Query Query Azure Storage Blobs Azure Data Lake Store Azure SQL Database Azure SQL Data Warehouse Azure SQL in VMs Azure Data Lake Analytics
  • 25. 25
  • 26. Meet StackExchange ➔ Over 280 subwebsites ➔ 150+ GB of open-source data ➔ Different kinds of data ➔ Posts ➔ Users ➔ Votes ➔ ... ➔ A big data sample data set
  • 27. What AreWe GoingTo Do? • Downloading the original data set Acquiring The Data • Upload data set to Azure • Determine what service to use Moving The Data • Merging data from each site into one file • Conversion from XML to CSV Aggregating The Data • Run business logic on it • Attempt to gain knowledge from it Analyzing The Data • Visualize what we’ve learned Visualizing The Data 27
  • 28. Azure Data Lake tools forVisual Studio ➔ Projects / Solutions / Source control ➔ Store Explorer ➔ Browse store ➔ Download complete / subset of file ➔ Preview ➔ JobVisualizer ➔ Determine bottlenecks by using heatmaps ➔ Playback jobs based on telemetry ➔ Query optimization ➔ Job Profiler ➔ Off-Line execution 28
  • 29. Integration with Azure Services ➔ Integrate in your data pipelines in Azure Data Factory ➔ Move data from Azure Data Lake Store to other store ➔ Move data to Azure Data Lake Store ➔ Run U-SQL query within pipeline ➔ Integration with Azure Data Catalog ➔ Register your Azure Data Lake Store assets 29
  • 30. Pricing ➔ Data Lake Store ➔ $0,08/GB stored per month ➔ $0,14 per 1M transactions • 1 transaction is block of up to 128 kB ➔ Egress will be billed but not know yet ➔ Data Lake Analytics ➔ $0,05 per job ➔ $0,05 per minute per Analytics Unit for processing time 30
  • 31. Azure Data Lake Store vs Blob Storage 31 No Limitations Store whatever you want in any format Security Built-in Azure Active Directory support Pricing More expensive than Storage RA-GRS Redundancy It’s there but no control over it Built for Scale Optimized for high- scale reads Integration With Data Factory, Data Catalog & HDInsight
  • 32. 32
  • 33. Summary ➔ Big Data is not just a hype so get ready ➔ Azure Data Lake Store ➔ Analyse today & explore tomorrow ➔ Data Swamps ➔ Data Lake Analytics ➔ No cluster management ➔ Re-use existing skills ➔ Pay for what we use ➔ Big Data in Azure? Azure Data Lake family and it’s easy!
  • 35. 35
  • 36. 36
  • 37. 37