SlideShare a Scribd company logo
SQLBits 2016
(http://www.slideshare.net/MichaelRys)
Azure Data Lake &
U-SQL
Michael Rys, @MikeDoesBigData
http://www.azure.com/datalake
{mrys, usql}@microsoft.com
The Data Lake Approach
Implement Data Warehouse
Reporting &
Analytics
Development
Reporting &
Analytics Design
Physical DesignDimension Modelling
ETL
Development
ETL Design
Install and TuneSetup Infrastructure
Traditional data warehousing approach
Data sources
ETL
BI and analytics
Data warehouse
Understand
Corporate
Strategy
Gather
Requirements
Business
Requirements
Technical
Requirements
The Data Lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Source: ComScore 2009-2015 Search Report US
9%
11%
15%
16%
18%
19%
20%
0%
5%
10%
15%
20%
25%
2009 2010 2011 2012 2013 2014 2015
MICROSOFT DOUBLES SEARCH SHARE
How Microsoft has used
Big Data
We needed to better leverage data and
analytics to win in search
We changed our approach
• More experiments by more people!
So we…
Built an Exabyte-scale data lake for everyone
to put their data.
Built tools approachable by any developer.
Built machine learning tools for collaborating
across large experiment models.
Introducing Azure Data Lake
Big Data Made Easy
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake
Azure Data Lake
Storage Service
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE GRADE access control, encryption
at rest
Optimized for analytic workload
PERFORMANCE
Azure Data Lake
Store
A hyper scale repository for big
data analytics workloads
IN PREVIEW
Azure Data Lake
Analytics Service
WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
ADLA complements HDInsight
Target the same scenarios, tools, and customers
HDInsight
For developers familiar with the
Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor
No limits to SCALE
Includes U-SQL, a language that unifies the
benefits of SQL with the expressive power of C#
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data sources
ENTERPRISE GRADE role-based access control
and auditing
Pay PER QUERY and scale PER QUERY
Azure Data Lake
Analytics
A distributed analytics service
built on Apache YARN that
dynamically scales to your
needs
IN PREVIEW
ADL and SQLDW
Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
Azure Data Lake
U-SQL
ADL/U-SQL Introduction (SQLBits 2016)
Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
•Requires processing
of any type of data
•Allow use of custom
algorithms
•Scale to any size and
be efficient
Status Quo:
SQL for
Big Data
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code
Status Quo:
Programming
Languages for
Big Data
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string
 Often no code reuse/
sharing across queries
Why U-SQL?  Declarativity and Extensibility are
equally native to the language!
Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!
The origins
of U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
ADL/U-SQL Introduction (SQLBits 2016)
U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Expression-flow
Programming Style
Automatic "in-lining" of U-SQL
expressions – whole script leads to a
single execution model.
Execution plan that is optimized out-
of-the-box and w/o user
intervention.
Per job and user driven level of
parallelization.
Detail visibility into execution steps,
for debugging.
Heatmap like functionality to identify
performance bottlenecks.
Unifies natively SQL’s declarativity and C#’s extensibility
Unifies querying structured and unstructured
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
Additional
resources
• Tools:
• http://aka.ms/adltoolsVS
• Blogs, videos and community page:
• http://usql.io (Link to Github with code samples)
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-data/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• http://www.slideshare.net/MichaelRys
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
http://aka.ms/AzureDataLake

More Related Content

PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
PPTX
U-SQL Intro (SQLBits 2016)
PPTX
Introducing U-SQL (SQLPASS 2016)
PPTX
U-SQL - Azure Data Lake Analytics for Developers
PPTX
Using C# with U-SQL (SQLBits 2016)
PPTX
U-SQL Learning Resources (SQLBits 2016)
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
U-SQL Intro (SQLBits 2016)
Introducing U-SQL (SQLPASS 2016)
U-SQL - Azure Data Lake Analytics for Developers
Using C# with U-SQL (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

What's hot (20)

PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
PPTX
Microsoft's Hadoop Story
PPTX
U-SQL Does SQL (SQLBits 2016)
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
PPTX
U-SQL Query Execution and Performance Tuning
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
What's new in Mondrian 4?
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
U-SQL Meta Data Catalog (SQLBits 2016)
Microsoft's Hadoop Story
U-SQL Does SQL (SQLBits 2016)
Killer Scenarios with Data Lake in Azure with U-SQL
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
U-SQL Partitioned Data and Tables (SQLBits 2016)
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
U-SQL Query Execution and Performance Tuning
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Apache Calcite (a tutorial given at BOSS '21)
What's new in Mondrian 4?
Ad

Viewers also liked (11)

PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
PPTX
Azure Data Lake Intro (SQLBits 2016)
PPTX
Azure Data Lake and U-SQL
PPTX
Analyzing StackExchange data with Azure Data Lake
PPTX
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Microsoft cloud big data strategy
PPTX
PDF
Microsoft Power BI Overview
PPTX
Azure Data Lake Analytics Deep Dive
PPTX
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
U-SQL Federated Distributed Queries (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake and U-SQL
Analyzing StackExchange data with Azure Data Lake
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Choosing technologies for a big data solution in the cloud
Microsoft cloud big data strategy
Microsoft Power BI Overview
Azure Data Lake Analytics Deep Dive
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Ad

Similar to ADL/U-SQL Introduction (SQLBits 2016) (20)

PPTX
3 CityNetConf - sql+c#=u-sql
PDF
Introduction to Azure Data Lake
PDF
USQL Trivadis Azure Data Lake Event
PPTX
Azure data lake sql konf 2016
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PDF
USQ Landdemos Azure Data Lake
PPTX
Tokyo azure meetup #2 big data made easy
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
Azure Data Lake and Azure Data Lake Analytics
PDF
Talavant Data Lake Analytics
PPTX
Microsoft Azure Big Data Analytics
PPTX
Azure Lowlands: An intro to Azure Data Lake
PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
PPTX
Dive Into Azure Data Lake - PASS 2017
PPTX
NDC Sydney - Analyzing StackExchange with Azure Data Lake
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
PPTX
Ai big dataconference_eugene_polonichko_azure data lake
PDF
Prague data management meetup 2018-03-27
PPTX
An intro to Azure Data Lake
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
3 CityNetConf - sql+c#=u-sql
Introduction to Azure Data Lake
USQL Trivadis Azure Data Lake Event
Azure data lake sql konf 2016
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
USQ Landdemos Azure Data Lake
Tokyo azure meetup #2 big data made easy
Big Data Analytics in the Cloud with Microsoft Azure
Azure Data Lake and Azure Data Lake Analytics
Talavant Data Lake Analytics
Microsoft Azure Big Data Analytics
Azure Lowlands: An intro to Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Dive Into Azure Data Lake - PASS 2017
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Ai big dataconference_eugene_polonichko_azure data lake
Prague data management meetup 2018-03-27
An intro to Azure Data Lake
Differentiate Big Data vs Data Warehouse use cases for a cloud solution

More from Michael Rys (6)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data Processing with .NET and Spark (SQLBits 2020)
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPTX
Managing Community Partner Relationships
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Transcultural that can help you someday.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Introduction to Data Science and Data Analysis
PDF
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
Managing Community Partner Relationships
Reliability_Chapter_ presentation 1221.5784
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Transcultural that can help you someday.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Leprosy and NLEP programme community medicine
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Data Science and Data Analysis
Clinical guidelines as a resource for EBP(1).pdf

ADL/U-SQL Introduction (SQLBits 2016)

  • 1. SQLBits 2016 (http://www.slideshare.net/MichaelRys) Azure Data Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com
  • 2. The Data Lake Approach
  • 3. Implement Data Warehouse Reporting & Analytics Development Reporting & Analytics Design Physical DesignDimension Modelling ETL Development ETL Design Install and TuneSetup Infrastructure Traditional data warehousing approach Data sources ETL BI and analytics Data warehouse Understand Corporate Strategy Gather Requirements Business Requirements Technical Requirements
  • 4. The Data Lake approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 5. Source: ComScore 2009-2015 Search Report US 9% 11% 15% 16% 18% 19% 20% 0% 5% 10% 15% 20% 25% 2009 2010 2011 2012 2013 2014 2015 MICROSOFT DOUBLES SEARCH SHARE How Microsoft has used Big Data We needed to better leverage data and analytics to win in search We changed our approach • More experiments by more people! So we… Built an Exabyte-scale data lake for everyone to put their data. Built tools approachable by any developer. Built machine learning tools for collaborating across large experiment models.
  • 6. Introducing Azure Data Lake Big Data Made Easy
  • 7. Analytics Storage HDInsight (“managed clusters”) Azure Data Lake Analytics Azure Data Lake Storage Azure Data Lake
  • 9. No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE GRADE access control, encryption at rest Optimized for analytic workload PERFORMANCE Azure Data Lake Store A hyper scale repository for big data analytics workloads IN PREVIEW
  • 11. WebHDFS YARN U-SQL ADL Analytics ADL HDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  • 12. ADLA complements HDInsight Target the same scenarios, tools, and customers HDInsight For developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
  • 13. No limits to SCALE Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# Optimized to work with ADL STORE FEDERATED QUERY across Azure data sources ENTERPRISE GRADE role-based access control and auditing Pay PER QUERY and scale PER QUERY Azure Data Lake Analytics A distributed analytics service built on Apache YARN that dynamically scales to your needs IN PREVIEW
  • 15. Query data where it lives Easily query data in multiple Azure data stores without moving it to a single store Benefits • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  • 18. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms Characteristics of Big Data Analytics •Requires processing of any type of data •Allow use of custom algorithms •Scale to any size and be efficient
  • 19. Status Quo: SQL for Big Data  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  • 20. Status Quo: Programming Languages for Big Data  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  • 21. Why U-SQL?  Declarativity and Extensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  • 22. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  • 23. Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 25. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 26. Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out- of-the-box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks.
  • 27. Unifies natively SQL’s declarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
  • 28. Additional resources • Tools: • http://aka.ms/adltoolsVS • Blogs, videos and community page: • http://usql.io (Link to Github with code samples) • http://blogs.msdn.com/b/visualstudio/ • http://azure.microsoft.com/en-us/blog/topics/big-data/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation and articles and slides: • http://aka.ms/usql_reference • https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys • ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql

Editor's Notes

  • #4: The Data Warehouses leverages the top-down approach where there is a well-architected information store and enterprisewide BI solution. To build a data warehouse follows the top-down approach where the company’s corporate strategy is defined first. This is followed by gathering of business and technical requirements for the warehouse. The data warehouse is then implemented by dimension modelling and ETL design followed by the actual development of the warehouse. This is all done prior to any data being collected. It utilizes a rigorous and formalized methodology because a true enterprise data warehouse supports many users/applications within an organization to make better decisions.
  • #5: A data lake is an enterprise wide repository of every type of data collected in a single place. Data of all types can be arbitrarily stored in the data lake prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics. Advanced analytics can be done using Hadoop, Machine Learning tools, or act as a lower cost data preparation location prior to moving curated data into a data warehouse. In these cases, customers would load data into the data lake prior to defining any transformation logic. This is bottom up because data is collected first and the data itself gives you the insight and helps derive conclusions or predictive models.
  • #17: Other points to make here, but not called out above Built on Apache YARN Scales dynamically with the turn of a dial Supports Azure AD for access control, roles, and integration with on-prem identity systems U-SQL’s scalable runtime processes data across multiple Azure data sources
  • #19: DATA SOURCE: Represents a remote data source such as Azure SQL Database. Have to specify all the details (connection string, credentials, etc required to connect to and issues queries. EXTERNAL TABLE: A local table, with columns defined in C# types, that redirects queries issued against it to the remote table that it is based on. U-SQL automatically does the type conversion. External tables lets you impose a specific schema against the remote data, shielding you from remote schema changes. You can issue queries that ‘join’ external and local tables. PASS THROUGH queries: These queries are issued directly against the remote data source in the syntax of the remote data source (say T-SQL for Azure SQL database). REMOTABLE_TYPES: For every external data source you have to specify the list of ‘remoteable types. This list constrains the types of queries that will be remoted. Ex: REMOTABLE_TYPES = (bool, byte, short, ushort, int, decimal); LAZY METADATA LOADING: Here the remote data schematized only when the query is actually issues to the remote data source. Your program must be able to deal with remote schema changes.
  • #22: Add velocity?
  • #23: Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • #24: Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • #25: Offers Auto-scaling and performance Operates on unstructured data without tables needed Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg. Easy to query remote sources even without external tables U-SQL UDAgg Code and compile .cs file: Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate() C# takes case of type checking, generics etc. Deploy: Tooling: one click registration in user db of assembly By Hand: Copy file to ADL CREATE ASSEMBLY to register assembly Use via AGG<MyNamespace.MyAggregate<T>>(a) U-SQL UDF Code in C#, register assembly once, call by C# name.
  • #26: Remove SCOPE for external customers?
  • #27: C# is the extension story for U-SQL Expressions in SELECT statement User-defined operators (UDOs) User-defined functions (UDFs) User-defined aggregates (UDAGGs) User-defined types (UDTs) UDOs are central to U-SQL user experience UDFs, UDAGGs, UDOs and UDTs require assemblies to be registered (one-time cost, fixed assembly version) UDFs UDAGGs, UDOs and UDTs will automatically be available after referencing assembly in script One version of assembly per database Assembly with same short name is not allowed Tooling provides code-behind and aut-odeploy experience
  • #29: Use for language experts
  • #32: Additional resources Blogs and community page: http://usql.io http://blogs.msdn.com/b/visualstudio/ http://azure.microsoft.com/en-us/blog/topics/big-data/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search Documentation: http://aka.ms/usql_reference https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/ ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql