SlideShare a Scribd company logo
Big Data Processing with .NET and Spark
Michael Rys
Principal Program Manager, Azure Data
@MikeDoesBigData
Agenda What is Apache Spark
Why .NET for Apache Spark
What is .NET for Apache Spark
Demos
How does it perform
Where does it run
Special Announcement & Call to Action
 Apache Spark is an OSS fast analytics engine for big data and machine
learning
 Improves efficiency through:
 General computation graphs beyond map/reduce
 In-memory computing primitives
 Allows developers to scale out their user code & write in their language of
choice
 Rich APIs in Java, Scala, Python, R, SparkSQL etc.
 Batch processing, streaming and interactive shell
 Available on Azure via
Azure Synapse Azure Databricks
Azure HDInsight IaaS/Kubernetes
.NET Developers 💖 Apache Spark…
A lot of big data-usable business logic (millions
of lines of code) is written in .NET!
Expensive and difficult to translate into
Python/Scala/Java!
Locked out from big data processing due to
lack of .NET support in OSS big data solutions
In a recently conducted .NET Developer survey (> 1000 developers), more than 70%
expressed interest in Apache Spark!
Would like to tap into OSS eco-system for: Code libraries, support, hiring
Goal: .NET for Apache Spark is aimed at providing
.NET developers a first-class experience when
working with Apache Spark.
Non-Goal: Converting existing Scala/Python/Java
Spark developers.
We are developing it in the open!
Contributions to foundational OSS projects:
• Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284,
SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373
• Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737,
ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887,
ARROW-5908, ARROW-6314, ARROW-6682
• Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to
Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance
.NET for Apache Spark is open source
• Website: https://dot.net/spark
• GitHub: https://github.com/dotnet/spark
• Frequent releases (about every 6 weeks), current release v0.12.1
• Integrates with .NET Interactive (https://github.com/dotnet/interactive) and
nteract/Jupyter
Spark project improvement proposals:
• Interop support for Spark language extensions: SPARK-26257
• .NET bindings for Apache Spark: SPARK-27006
Journey so far
~2k
GitHub unique
visitors/wk
~8k
GitHub page
views/wk
260
GitHub issues
closed
246
GitHub PRs
merged
127k
Nuget
Downloads
39
GitHub
Contributors
Journey so far
Customer Success: O365’s MSAI
Job:
Build ML/Deep models on top of
substrate data to infuse intelligence
to Office 365 products. Our data
resides in Azure Data Lake Storage.
We write cook/featurize data that in
turn gets fed into our ML models.
Why Spark.NET?
Given our business logic e.g.,
featurizers, tokenizers for
normalizing text, are written in C# –
Spark.NET is an ideal candidate for
our workloads. We leverage
Spark.NET to run those libraries at
scale.
Experience:
Very promising, stable & highly
vibrant community that is helping us
iterate at the agility we want.
Looking forward to longer working
relationship and broader adoption
across Substrate Intelligence / MSAI.
Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365
Scale: ~ 50 TB
.NET provides full-spectrum Spark support
Spark DataFrames
with SparkSQL
Works with
Spark v2.3.x/v2.4.x
and includes
~300 SparkSQL
functions
Grouped Map
Delta Lake
.NET Spark UDFs
Batch &
streaming
Including
Spark Structured
Streaming and all
Spark-supported data
sources
.NET Standard 2.0
Works with
.NET Framework v4.6.1+
and .NET Core v2.1/v3.1
and includes C#/F#
support
.NET
Standard
Data Science
Including access to
ML.NET
Interactive Notebook
with C# REPL
Speed &
productivity
Performance optimized
interop, as fast or faster
than pySpark,
Support for HW
Vectorization
https://github.com/dotnet/spark/examples
UserId State Salary
Terry WA XX
Rahul WA XX
Dan WA YY
Tyson CA ZZ
Ankit WA YY
Michae
l
WA YY
Introduction to Spark Programming:
DataFrame
.NET for Apache Spark programmability
var spark = SparkSession.Builder().GetOrCreate();
var dataframe =
spark.Read().Json(“input.json”);
dataframe.Filter(df["age"] > 21)
.Select(concat(df[“age”], df[“name”]).Show();
var concat =
Udf<int?, string, string>((age, name)=>name+age);
Language comparison: TPC-H Query 2
val europe = region.filter($"r_name" === "EUROPE")
.join(nation, $"r_regionkey" === nation("n_regionkey"))
.join(supplier, $"n_nationkey" === supplier("s_nationkey"))
.join(partsupp,
supplier("s_suppkey") === partsupp("ps_suppkey"))
val brass = part.filter(part("p_size") === 15
&& part("p_type").endsWith("BRASS"))
.join(europe, europe("ps_partkey") === $"p_partkey")
val minCost = brass.groupBy(brass("ps_partkey"))
.agg(min("ps_supplycost").as("min"))
brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey"))
.filter(brass("ps_supplycost") === minCost("min"))
.select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.sort($"s_acctbal".desc,
$"n_name", $"s_name", $"p_partkey")
.limit(100)
.show()
var europe = region.Filter(Col("r_name") == "EUROPE")
.Join(nation, Col("r_regionkey") == nation["n_regionkey"])
.Join(supplier, Col("n_nationkey") == supplier["s_nationkey"])
.Join(partsupp,
supplier["s_suppkey"] == partsupp["ps_suppkey"]);
var brass = part.Filter(part["p_size"] == 15
& part["p_type"].EndsWith("BRASS"))
.Join(europe, europe["ps_partkey"] == Col("p_partkey"));
var minCost = brass.GroupBy(brass["ps_partkey"])
.Agg(Min("ps_supplycost").As("min"));
brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"])
.Filter(brass["ps_supplycost"] == minCost["min"])
.Select("s_acctbal", "s_name", "n_name",
"p_partkey", "p_mfgr", "s_address",
"s_phone", "s_comment")
.Sort(Col("s_acctbal").Desc(),
Col("n_name"), Col("s_name"), Col("p_partkey"))
.Limit(100)
.Show();
Similar syntax – dangerously copy/paste friendly!
$”col_name” vs. Col(“col_name”) Capitalization
Scala C#
C# vs Scala (e.g., == vs ===)
Demo 1: Getting started locally
Submitting a Spark Application
spark-submit `
--class <user-app-main-class> `
--master local `
<path-to-user-jar>
<argument(s)-to-your-app>
spark-submit
(Scala)
spark-submit `
--class org.apache.spark.deploy.DotnetRunner `
--master local `
<path-to-microsoft-spark-jar> `
<path-to-your-app-exe> <argument(s)-to-your-app>
spark-submit
(.NET)
Provided by .NET for
Apache Spark Library
Provided by User &
has business logic
Demo 2: Locally debugging a .NET for Spark
App
spark-submit --class
org.apache.spark.deploy.DotnetRunner `
--master local <path-to-microsoft-spark-jar> `
Debugging User-defined Code
https://github.com/dotnet/spark/pull/294
Step 1
Write your app code
Step 2
set DOTNET_WORKER_DEBUG=1
Run spark-submit with debug argument
Step 3
Switch to app code, add breakpoint
at your business logic, F5
Step 4
In the `Choose Just-In-Time
Debugger`, choose “New Instance of
…”, select your app code CS file
Step 5
That’s it! Have fun 
Demo 2: Twitter analysis in the Cloud
What is happening when you write .NET Spark code?
DataFrame
SparkSQL
.NET for
Apache
Spark
.NET
Program
Did you
define
a .NET
UDF?
Regular execution path
(no .NET runtime during execution)
Same Speed as with Scala Spark
Interop between Spark and .NET
Faster than with PySpark
No
Yes
Spark
operation tree
Spark Worker Node JVM
Spark Executor Microsoft.Spark.Worker
Spark Worker Node CLR
Run a task with
a UDF
1
Launch worker executable2
3 Serialize UDFs &
data
.NET UDF Library
4 Execute user-defined
operations
5 Write serialized result rows
User Spark Library
Legend:
Interop (Scala) Interop (.NET)
Challenge:
How to serialize data
between JVM & CLR?
Pickling
Row-oriented
Apache Arrow
Column-oriented
Default
Performance: Worker-side Interop
df.GroupBy("age")
.Apply(
new StructType(new[]
{
new StructField("age", new IntegerType()),
new StructField("nameCharCount", new IntegerType())
}),
batch => CountCharacters(batch, "age", "name"))
.Show();
Simplifying experience with Arrow
private static FxDataFrame CountCharacters(
FxDataFrame df,
string groupColName,
string summaryColName)
{
int charCount = 0;
for (long i = 0; i < df.RowCount; ++i)
{
charCount += ((string)df[summaryColName][i]).Length;
}
return new FxDataFrame(new[] {
new PrimitiveColumn<int>(groupColName,
new[] { (int?)df[groupColName][0] }),
new PrimitiveColumn<int>(summaryColName,
new[] { charCount }) });
}
private static RecordBatch CountCharacters(
RecordBatch records,
string groupColName,
string summaryColName)
{
int summaryColIndex = records.Schema.GetFieldIndex(summaryColName);
StringArray stringValues = records.Column(summaryColIndex) as StringArray;
int charCount = 0;
for (int i = 0; i < stringValues.Length; ++i)
{
charCount += stringValues.GetString(i).Length;
}
int groupColIndex = records.Schema.GetFieldIndex(groupColName);
Field groupCol = records.Schema.GetFieldByIndex(groupColIndex);
return new RecordBatch(
new Schema.Builder()
.Field(groupCol)
.Field(f => f.Name(summaryColName).DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
records.Column(groupColIndex),
new Int32Array.Builder().Append(charCount).Build()
},
records.Length);
}
Previous Experience New Experience
Simplifying experience with Arrow
Performance –
warm cluster runs
for Pickling
Serialization
(Arrow
improvements see
next slide)
Takeaway 1: Where UDF
performance does not
matter, .NET is on-par
with Python
Takeaway 2: Where UDF
performance is critical, .NET
is ~2x faster than Python!
Performance –
Warm Cluster
Runs for C#
Pickling vs.
Arrow
Serialization
Takeaway: Since Q1 is
interop bound, we see 33%
perf improvement with
better serialization
Performance –
Warm Cluster
Runs for Arrow
Serialization in
C# vs. Python
Takeaway: Since serialization
inefficiencies have been removed,
we are left with similar perf across
languages – if you like .NET, you can
stick with .NET 
Works everywhere!
Cross platform
Cross Cloud
Windows Ubuntu
Azure & AWS
Databricks
macOS
AWS EMR
Spark
Azure HDI
Spark
Installed out of
the box
Azure
Synapse
Installation docs
on Github
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
• cd mySparkApp
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64
• Zip the folder
• Upload ZIP file to your cloud storage
Using .NET for Spark in Azure Synapse
Batch Submission
Language selects semantics of
submission fields
ZIP file that contains the Spark
application, including UDF DLLs, and
even the Spark or .NET Runtime if a
different version is needed
Main Program (Unix)
Program Parameters as needed
Additional resource and library files
that are not included in the ZIP (e.g.,
shared DLLs, config files)
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive
Language selects Type of notebook
Interactive C#
Spark context spark is built-in
Using .NET for Spark in Azure Synapse
Notebooks with .NET Interactive – importing nuget packages
Using .NET for Spark in Azure Databricks
• Not available out of the box but can be used in batch submission
• https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks
Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET.
Please contact @Databricks if you want to use it out of the box 
VSCode extension for Spark .NET
• Spark .NET Project creation​
• Dependency packaging​
• Language service
• Sample code
Author
• Reference management
• Spark local run​
• Spark cluster run (e.g. HDInsight)
Run
• DebugFix
Extension to VSCode
 Tap into VSCode for C# programming
 Automate Maven and Spark dependency
for environment setup
 Facilitate first project success through
project template and sample code
 Support Spark local run and cluster run
 Integrate with Azure for HDInsight clusters
navigation
 Azure Databricks integration planned
ANNOUNCING: .NET for Apache Spark v1.0 is released!
 First-class C# and F# bindings to Apache Spark,
bringing the power of big data analytics to .NET
developers
Apache Spark 2.4/3.0
Data Frames, Structured
Streaming, Delta Lake
.NET Standard 2.0
C# and F#
ML.NET
Performance optimized
with Apache Arrow and
HW Vectorization
First class integration in
Azure Synapse: Batch
Submission
Interactive .NET notebooks
Learn more at
http://dot.net/Spark
More
programming
experiences in
.NET
(UDAF, UDT
support, multi-
language UDFs)
What’s next?
Spark data
connectors in
.NET
(e.g., Apache Kafka,
Azure Blob Store,
Azure Data Lake)
Tooling
experiences
(e.g.,
Jupyter/nteract,
VS Code, Visual
Studio, others?)
Idiomatic
experiences
for C# and F#
(LINQ, Type
Provider)
Go to https://github.com/dotnet/spark and let us know what is important to you!
Out-of-Box
Experiences
(Azure Synapse,
Azure HDInsight,
Azure Databricks,
Cosmos DB Spark,
SQL 2019 BDC, …)
Call to action: Engage, use & guide us!
Related session:
• Big Data and Data Warehousing Together with Azure
Synapse Analytics
Useful links:
• http://github.com/dotnet/spark
• https://www.nuget.org/packages/Microsoft.Spark
https://aka.ms/GoDotNetForSpark
• https://docs.microsoft.com/dotnet/spark
Website:
• https://dot.net/spark (Request a Demo!)
Starter Videos .NET for Apache Spark 101:
• Watch on YouTube
• Watch on Channel 9
Available out-of-box on
Azure Synapse & Azure HDInsight Spark
Running .NET for Spark anywhere—
https://aka.ms/InstallDotNetForSpark
You &
@MikeDoesBigData #DotNetForSpark
© Copyright Microsoft Corporation. All rights reserved.

More Related Content

What's hot (20)

Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
An intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
Introduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Spark SQL
Spark SQL
Caserta
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
TechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Data Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Spark sql meetup
Spark sql meetup
Michael Zhang
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
 
Introduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
TechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Data Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

Similar to Big Data Processing with .NET and Spark (SQLBits 2020) (20)

.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Introduction to spark
Introduction to spark
Home
 
Dev Ops Training
Dev Ops Training
Spark Summit
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa
Marco Parenzan
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Spark.pptx to knowledge gaining in wdm days ago
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Introduction to spark
Introduction to spark
Home
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
.net developer for Jupyter Notebook and Apache Spark and viceversa
Marco Parenzan
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Spark.pptx to knowledge gaining in wdm days ago
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Ad

More from Michael Rys (20)

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Michael Rys
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
Michael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
Michael Rys
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Ad

Recently uploaded (20)

2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
SAP_S4HANA_PPM_IT_Corporate_Services_Presentation.pptx
vemulavenu484
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 

Big Data Processing with .NET and Spark (SQLBits 2020)

  • 1. Big Data Processing with .NET and Spark Michael Rys Principal Program Manager, Azure Data @MikeDoesBigData
  • 2. Agenda What is Apache Spark Why .NET for Apache Spark What is .NET for Apache Spark Demos How does it perform Where does it run Special Announcement & Call to Action
  • 3.  Apache Spark is an OSS fast analytics engine for big data and machine learning  Improves efficiency through:  General computation graphs beyond map/reduce  In-memory computing primitives  Allows developers to scale out their user code & write in their language of choice  Rich APIs in Java, Scala, Python, R, SparkSQL etc.  Batch processing, streaming and interactive shell  Available on Azure via Azure Synapse Azure Databricks Azure HDInsight IaaS/Kubernetes
  • 4. .NET Developers 💖 Apache Spark… A lot of big data-usable business logic (millions of lines of code) is written in .NET! Expensive and difficult to translate into Python/Scala/Java! Locked out from big data processing due to lack of .NET support in OSS big data solutions In a recently conducted .NET Developer survey (> 1000 developers), more than 70% expressed interest in Apache Spark! Would like to tap into OSS eco-system for: Code libraries, support, hiring
  • 5. Goal: .NET for Apache Spark is aimed at providing .NET developers a first-class experience when working with Apache Spark. Non-Goal: Converting existing Scala/Python/Java Spark developers.
  • 6. We are developing it in the open! Contributions to foundational OSS projects: • Apache Spark Core: SPARK-28271, SPARK-28278, SPARK-28283, SPARK-28282, SPARK-28284, SPARK-28319, SPARK-28238, SPARK-28856, SPARK-28970, SPARK-29279, SPARK-29373 • Apache Arrow: ARROW-4997, ARROW-5019, ARROW-4839, ARROW-4502, ARROW-4737, ARROW-4543, ARROW-4435, ARROW-4503, ARROW-4717, ARROW-4337, ARROW-5887, ARROW-5908, ARROW-6314, ARROW-6682 • Pyrolite (Pickling Library): Improve pickling/unpickling performance, Add a Strong Name to Pyrolite, Improve Pickling Performance, Hash set handling, Improve unpickling performance .NET for Apache Spark is open source • Website: https://dot.net/spark • GitHub: https://github.com/dotnet/spark • Frequent releases (about every 6 weeks), current release v0.12.1 • Integrates with .NET Interactive (https://github.com/dotnet/interactive) and nteract/Jupyter Spark project improvement proposals: • Interop support for Spark language extensions: SPARK-26257 • .NET bindings for Apache Spark: SPARK-27006
  • 7. Journey so far ~2k GitHub unique visitors/wk ~8k GitHub page views/wk 260 GitHub issues closed 246 GitHub PRs merged 127k Nuget Downloads 39 GitHub Contributors
  • 9. Customer Success: O365’s MSAI Job: Build ML/Deep models on top of substrate data to infuse intelligence to Office 365 products. Our data resides in Azure Data Lake Storage. We write cook/featurize data that in turn gets fed into our ML models. Why Spark.NET? Given our business logic e.g., featurizers, tokenizers for normalizing text, are written in C# – Spark.NET is an ideal candidate for our workloads. We leverage Spark.NET to run those libraries at scale. Experience: Very promising, stable & highly vibrant community that is helping us iterate at the agility we want. Looking forward to longer working relationship and broader adoption across Substrate Intelligence / MSAI. Microsoft Search, Assistant & Intelligence Team: Towards Modern Workspaces in O365 Scale: ~ 50 TB
  • 10. .NET provides full-spectrum Spark support Spark DataFrames with SparkSQL Works with Spark v2.3.x/v2.4.x and includes ~300 SparkSQL functions Grouped Map Delta Lake .NET Spark UDFs Batch & streaming Including Spark Structured Streaming and all Spark-supported data sources .NET Standard 2.0 Works with .NET Framework v4.6.1+ and .NET Core v2.1/v3.1 and includes C#/F# support .NET Standard Data Science Including access to ML.NET Interactive Notebook with C# REPL Speed & productivity Performance optimized interop, as fast or faster than pySpark, Support for HW Vectorization https://github.com/dotnet/spark/examples
  • 11. UserId State Salary Terry WA XX Rahul WA XX Dan WA YY Tyson CA ZZ Ankit WA YY Michae l WA YY Introduction to Spark Programming: DataFrame
  • 12. .NET for Apache Spark programmability var spark = SparkSession.Builder().GetOrCreate(); var dataframe = spark.Read().Json(“input.json”); dataframe.Filter(df["age"] > 21) .Select(concat(df[“age”], df[“name”]).Show(); var concat = Udf<int?, string, string>((age, name)=>name+age);
  • 13. Language comparison: TPC-H Query 2 val europe = region.filter($"r_name" === "EUROPE") .join(nation, $"r_regionkey" === nation("n_regionkey")) .join(supplier, $"n_nationkey" === supplier("s_nationkey")) .join(partsupp, supplier("s_suppkey") === partsupp("ps_suppkey")) val brass = part.filter(part("p_size") === 15 && part("p_type").endsWith("BRASS")) .join(europe, europe("ps_partkey") === $"p_partkey") val minCost = brass.groupBy(brass("ps_partkey")) .agg(min("ps_supplycost").as("min")) brass.join(minCost, brass("ps_partkey") === minCost("ps_partkey")) .filter(brass("ps_supplycost") === minCost("min")) .select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .sort($"s_acctbal".desc, $"n_name", $"s_name", $"p_partkey") .limit(100) .show() var europe = region.Filter(Col("r_name") == "EUROPE") .Join(nation, Col("r_regionkey") == nation["n_regionkey"]) .Join(supplier, Col("n_nationkey") == supplier["s_nationkey"]) .Join(partsupp, supplier["s_suppkey"] == partsupp["ps_suppkey"]); var brass = part.Filter(part["p_size"] == 15 & part["p_type"].EndsWith("BRASS")) .Join(europe, europe["ps_partkey"] == Col("p_partkey")); var minCost = brass.GroupBy(brass["ps_partkey"]) .Agg(Min("ps_supplycost").As("min")); brass.Join(minCost, brass["ps_partkey"] == minCost["ps_partkey"]) .Filter(brass["ps_supplycost"] == minCost["min"]) .Select("s_acctbal", "s_name", "n_name", "p_partkey", "p_mfgr", "s_address", "s_phone", "s_comment") .Sort(Col("s_acctbal").Desc(), Col("n_name"), Col("s_name"), Col("p_partkey")) .Limit(100) .Show(); Similar syntax – dangerously copy/paste friendly! $”col_name” vs. Col(“col_name”) Capitalization Scala C# C# vs Scala (e.g., == vs ===)
  • 14. Demo 1: Getting started locally
  • 15. Submitting a Spark Application spark-submit ` --class <user-app-main-class> ` --master local ` <path-to-user-jar> <argument(s)-to-your-app> spark-submit (Scala) spark-submit ` --class org.apache.spark.deploy.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app> spark-submit (.NET) Provided by .NET for Apache Spark Library Provided by User & has business logic
  • 16. Demo 2: Locally debugging a .NET for Spark App spark-submit --class org.apache.spark.deploy.DotnetRunner ` --master local <path-to-microsoft-spark-jar> `
  • 17. Debugging User-defined Code https://github.com/dotnet/spark/pull/294 Step 1 Write your app code Step 2 set DOTNET_WORKER_DEBUG=1 Run spark-submit with debug argument Step 3 Switch to app code, add breakpoint at your business logic, F5 Step 4 In the `Choose Just-In-Time Debugger`, choose “New Instance of …”, select your app code CS file Step 5 That’s it! Have fun 
  • 18. Demo 2: Twitter analysis in the Cloud
  • 19. What is happening when you write .NET Spark code? DataFrame SparkSQL .NET for Apache Spark .NET Program Did you define a .NET UDF? Regular execution path (no .NET runtime during execution) Same Speed as with Scala Spark Interop between Spark and .NET Faster than with PySpark No Yes Spark operation tree
  • 20. Spark Worker Node JVM Spark Executor Microsoft.Spark.Worker Spark Worker Node CLR Run a task with a UDF 1 Launch worker executable2 3 Serialize UDFs & data .NET UDF Library 4 Execute user-defined operations 5 Write serialized result rows User Spark Library Legend: Interop (Scala) Interop (.NET) Challenge: How to serialize data between JVM & CLR? Pickling Row-oriented Apache Arrow Column-oriented Default Performance: Worker-side Interop
  • 21. df.GroupBy("age") .Apply( new StructType(new[] { new StructField("age", new IntegerType()), new StructField("nameCharCount", new IntegerType()) }), batch => CountCharacters(batch, "age", "name")) .Show(); Simplifying experience with Arrow
  • 22. private static FxDataFrame CountCharacters( FxDataFrame df, string groupColName, string summaryColName) { int charCount = 0; for (long i = 0; i < df.RowCount; ++i) { charCount += ((string)df[summaryColName][i]).Length; } return new FxDataFrame(new[] { new PrimitiveColumn<int>(groupColName, new[] { (int?)df[groupColName][0] }), new PrimitiveColumn<int>(summaryColName, new[] { charCount }) }); } private static RecordBatch CountCharacters( RecordBatch records, string groupColName, string summaryColName) { int summaryColIndex = records.Schema.GetFieldIndex(summaryColName); StringArray stringValues = records.Column(summaryColIndex) as StringArray; int charCount = 0; for (int i = 0; i < stringValues.Length; ++i) { charCount += stringValues.GetString(i).Length; } int groupColIndex = records.Schema.GetFieldIndex(groupColName); Field groupCol = records.Schema.GetFieldByIndex(groupColIndex); return new RecordBatch( new Schema.Builder() .Field(groupCol) .Field(f => f.Name(summaryColName).DataType(Int32Type.Default)) .Build(), new IArrowArray[] { records.Column(groupColIndex), new Int32Array.Builder().Append(charCount).Build() }, records.Length); } Previous Experience New Experience Simplifying experience with Arrow
  • 23. Performance – warm cluster runs for Pickling Serialization (Arrow improvements see next slide) Takeaway 1: Where UDF performance does not matter, .NET is on-par with Python Takeaway 2: Where UDF performance is critical, .NET is ~2x faster than Python!
  • 24. Performance – Warm Cluster Runs for C# Pickling vs. Arrow Serialization Takeaway: Since Q1 is interop bound, we see 33% perf improvement with better serialization
  • 25. Performance – Warm Cluster Runs for Arrow Serialization in C# vs. Python Takeaway: Since serialization inefficiencies have been removed, we are left with similar perf across languages – if you like .NET, you can stick with .NET 
  • 26. Works everywhere! Cross platform Cross Cloud Windows Ubuntu Azure & AWS Databricks macOS AWS EMR Spark Azure HDI Spark Installed out of the box Azure Synapse Installation docs on Github
  • 27. • cd mySparkApp dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 28. • cd mySparkApp dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64 • Zip the folder • Upload ZIP file to your cloud storage Using .NET for Spark in Azure Synapse Batch Submission Language selects semantics of submission fields ZIP file that contains the Spark application, including UDF DLLs, and even the Spark or .NET Runtime if a different version is needed Main Program (Unix) Program Parameters as needed Additional resource and library files that are not included in the ZIP (e.g., shared DLLs, config files) https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/spark-dotnet
  • 29. Using .NET for Spark in Azure Synapse Notebooks with .NET Interactive Language selects Type of notebook Interactive C# Spark context spark is built-in
  • 30. Using .NET for Spark in Azure Synapse Notebooks with .NET Interactive – importing nuget packages
  • 31. Using .NET for Spark in Azure Databricks • Not available out of the box but can be used in batch submission • https://github.com/dotnet/spark/blob/master/deployment/README.md#databricks Note: Traditional Databricks notebooks are proprietary and cannot integrate .NET. Please contact @Databricks if you want to use it out of the box 
  • 32. VSCode extension for Spark .NET • Spark .NET Project creation​ • Dependency packaging​ • Language service • Sample code Author • Reference management • Spark local run​ • Spark cluster run (e.g. HDInsight) Run • DebugFix Extension to VSCode  Tap into VSCode for C# programming  Automate Maven and Spark dependency for environment setup  Facilitate first project success through project template and sample code  Support Spark local run and cluster run  Integrate with Azure for HDInsight clusters navigation  Azure Databricks integration planned
  • 33. ANNOUNCING: .NET for Apache Spark v1.0 is released!  First-class C# and F# bindings to Apache Spark, bringing the power of big data analytics to .NET developers Apache Spark 2.4/3.0 Data Frames, Structured Streaming, Delta Lake .NET Standard 2.0 C# and F# ML.NET Performance optimized with Apache Arrow and HW Vectorization First class integration in Azure Synapse: Batch Submission Interactive .NET notebooks Learn more at http://dot.net/Spark
  • 34. More programming experiences in .NET (UDAF, UDT support, multi- language UDFs) What’s next? Spark data connectors in .NET (e.g., Apache Kafka, Azure Blob Store, Azure Data Lake) Tooling experiences (e.g., Jupyter/nteract, VS Code, Visual Studio, others?) Idiomatic experiences for C# and F# (LINQ, Type Provider) Go to https://github.com/dotnet/spark and let us know what is important to you! Out-of-Box Experiences (Azure Synapse, Azure HDInsight, Azure Databricks, Cosmos DB Spark, SQL 2019 BDC, …)
  • 35. Call to action: Engage, use & guide us! Related session: • Big Data and Data Warehousing Together with Azure Synapse Analytics Useful links: • http://github.com/dotnet/spark • https://www.nuget.org/packages/Microsoft.Spark https://aka.ms/GoDotNetForSpark • https://docs.microsoft.com/dotnet/spark Website: • https://dot.net/spark (Request a Demo!) Starter Videos .NET for Apache Spark 101: • Watch on YouTube • Watch on Channel 9 Available out-of-box on Azure Synapse & Azure HDInsight Spark Running .NET for Spark anywhere— https://aka.ms/InstallDotNetForSpark You & @MikeDoesBigData #DotNetForSpark
  • 36. © Copyright Microsoft Corporation. All rights reserved.

Editor's Notes

  • #4: 3
  • #10: “Spark.Net team helped enhance the user experience which was a major issue for adoption in Satori”
  • #11: No RDD support.