PASS_Summit_2019_Azure_Storage_Options_for_Analytics

Azure Storage
Options for
Analytics
Dustin Vannoy
Data Engineer
Cloud + Streaming

everything PASS
has to offer
Free online
webinar events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Get involved
Free Online Resources
Newsletters
PASS.org
Explore

Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San
Diego
/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming

PASS Summit Learning Pathway:
Becoming an Azure Data Engineer
Roles and Responsibilities of the Azure Data Engineer
Jes Borland
Wednesday November 06, 10:15 AM
Room: TCC Tahoma 2
Azure Storage Options for Analytics
Dustin Vannoy
Wednesday, November 06, 3:15 PM
Room: TCC Skagit 4
An Azure Data Engineer’s ETL Toolkit
Simon Whiteley
Thursday, November 07, 3:15 PM
Room: TCC Tahoma 4
Data Modeling Trends for 2019 and Beyond
Ike Ellis
Friday, November 08, 9:30 AM
Room: 2AB

Azure Storage
for Analytics 1. Data Lakes
2. Data Warehouses
3. Analytics

Data Lake Defined
Varied Data
Raw, intermediate,
and fully
processed
Ready for Analysts
Query layer, other
analytic tools access
Big Data Capable
Store first,
evaluate and
model later
* Not just a file system

Store Everything
Why Data Lakes?
• CSV, JSON, Logs, Text
• No schema on write
• Cheaper storage
Reason #1

Massive Scale (Big Data)
Why Data Lakes?
• Serverless Hadoop
• Span hot and cold
storage
• Pay for what you use
Reason #2

Reason #3
Storage + Compute
Separate
Why Data Lakes?
• Cost savings
• Multiple analytics tools /
same data

D E M O
Example Data
Lake Querying

Data Lake Best Practices
• Metadata portal
• Not just raw data
• Dataset certification
• Not too much governance

Azure Blob Storage
• Storage for pretty much
anything
• Can choose from Block blob,
Append blob, or Page blob
• Low cost: $

Azure Blob Storage
Structure
Storage Account
Containers
Blobs

ADLS Gen 1 ADLS Gen 2
Azure Data Lake Storage
File system semantics
Granular security
Scale
Benefits from Gen 1
+ Low cost
+ Hierarchical namespace

Data Lake Storage, Gen 2
• Built on Azure Blob Storage
• Hadoop compatible access
• Optimized for cloud analytics
• Low cost: $$

ADLS Gen 2
Structure
Storage Account
File System
Files

Options for Import
Getting Data into ADLS Gen 2
• Azure Databricks
• Azure Data Factory
• AzCopy
• Azure Storage Explorer

Options for Access
Accessing Data From ADLS Gen 2
• Azure Databricks
• HD Insight
• Polybase (SQL DW / SQL Server)
• Power BI

D E M O
ADLS Gen 2:
Setup and Upload

Archive Storage
• Still part of Azure Blob Storage
• Seamless integration with hot/cool
• Keep everything
• Very low cost
but...
• High read cost
• Early deletion charges

Cost Comparison – Hot LRS
Type
Storage
(Dollars/GB)
Reads
(per 10,000)
Writes
(per 10,000)
Blob Storage (Hot) .021 .004 .055
ADLS Gen 2 (Hot) .021 .006 .072
* for ADLS every 4MB is considered an operation

Cost Comparison – Cool LRS
Type
Storage
(Dollars/GB)
Reads
(per 10,000)
Writes
(per 10,000)
Blob Storage (Cool) .015 .010 .100
ADLS Gen 2 (Cool) .015 .013 .130

Cost Comparison – Archive LRS
Type
Storage
(Dollars/GB)
Reads
(per 10,000)
Writes
(per 10,000)
Blob Storage
(Archive)
.002 5.500 .110
ADLS Gen 2 (Archive) .002 7.15 .143

Storage Redundancy Options
Review redundancy and cost implications: https://azure.microsoft.com/en-
us/pricing/details/storage/

Data Warehouse Defined
Structured Data
Processed and
modeled for
analytics use
Interactive queries
Analysts can get
answers to
questions quickly
BI tool support
Reporting tools
can query
efficiently

Speed of thought
Why Data Warehouses?
• Fast query response
• Indexing or column store
• SQL with analytic functions
Reason #1

Reason #2
Ready to use data
• Useful column names
• Cleaned and standardized
• Focused

Update/Delete
• Support for real-time
ingestion
• Keep latest view or
manage history
Reason #3

Data Warehouse Best Practices
• Staging data off limits
• Star schema design
• Indexing strategies
• Read replicas

Azure SQL DB
• Good ole relational database
• Less DBA work required
• Scalable on demand
• Medium cost: $$ - $$$$
Managed SQL Server

Azure SQL DB – Elastic pools
• DBs can auto-scale within the pool
• Can move DB to different pool
• Want DBs peak usage at different times
• Important to understand utilization of DBs
Resources shared among DBs

Azure SQL DB – Managed Instances
Most on-premise features supported
• SQL Agent jobs
• Change Data Capture
• Enabled CLR
• Cross database queries
• DB Mail
• Service Broker
• Transactional Replication
Best for migrations

Azure SQL DB – Hyperscale
• Storage, Compute, and Log scale separately
• Backups, restores and scaling not tied to volume of data
• Optimized for OLTP, but supports analytical workloads
• One way migration
Highly scalable storage and compute

Hyperscale
Architecture
http://aka.ms/
SQLDB_Hyperscale

D E M O
Azure SQL DB:
Analytics querying

Azure Synapse Analytics - SQL DW
• MPP - fast reads, many users
• Supports Polybase
• Scalable on demand
• High cost: $$$$
High performance Analytic DB

D E M O
Synapse Analytics
(SQL DW):
Analytics querying

Cosmos DB
• Useful for in-app analytics
• Best with known search key, e.g. CustomerID
• Key-value, Column-family, Document, Graph
• SQL, Cassandra, MongoDB, Gremlin, Table, etcd, Spark
• Medium cost: $$ - $$$
Managed NoSQL

Shared semantic model Cache data
Azure Analysis Services
Build calculations and
aggregations into a model
that can be used by many
analytics tools
Improve query speeds by
caching data

Visual report tool Supports most sources
Power BI
Build interactive dashboards
and reports or do
exploratory data analysis
Connects to everything
Azure and many other
source types

D E M O
Power BI:
Connect to Data
Lake

Keep Learning!
Databricks / ETL
10 Cool Things You Can Do With Azure Databricks – Ike, Simon, Dustin
An Azure Data Engineer's ETL Toolkit – Simon Whiteley
Code Like a Snake Charmer - Introduction to Python! – Jamey Johnston
Code Like a Snake Charmer – Advanced Data Modeling in Python! – Jamey Johnston
Cosmos
Cosmic DBA - Cosmos DB for SQL Server Admins and Developers – Michael Donnelly
CosmosDB - Designing and Troubleshooting Lessons – Neil Hambly
Data Modeling
Data Modeling Trends for 2019 and Beyond – Ike Ellis
Innovative Data Modeling for Cool Data Warehouses – Jeff Renz, Leslie Weed
Data Warehouse / SQL DB
Best, Better, Hyperscale! The Last Database You will Ever Need in the Cloud – Denzil Ribeiro
Introducing Azure Synapse Analytics: The End-to-End Analytics Platform Built for Every Data Professional – Saveen
Reddy
Azure SQL Database: Maximizing Cloud Performance and Availability – Joe Sack, Denzil Ribeiro
Delivering a Data Warehouse in the Cloud – Jeff Renz
Data Warehousing: Which of the Many Cloud Products is the Right One for You? – Ginger Grant

Session
Evaluations
Submit by 5pm Friday,
November 15th to
win prizes.
Download the GuideBook App and
search: PASS Summit 2019
Follow the QR code link on session
signage
Go to PASSsummit.com
3 W A Y S T O A C C E S S

Thank You
Dustin Vannoy
@dustinvannoy
dustin@dustinvannoy.com

PASS_Summit_2019_Azure_Storage_Options_for_Analytics

More Related Content

Similar to PASS_Summit_2019_Azure_Storage_Options_for_Analytics (20)

Recently uploaded (20)

PASS_Summit_2019_Azure_Storage_Options_for_Analytics

Editor's Notes