SlideShare a Scribd company logo
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
A Z U R E D A T A B R I C K S
Microsoft Azure
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
RAPID
EXPERIMENTATI
ON
DATA
VISUALIZATION
CROSS-TEAM
COLLABORATION
EASY SHARING
OF INSIGHTS
 Infrastructure management
 Data exploration and visualization at scale
 Time to value - From model iterations to intelligence
 Integrating with various ML tools to stitch a solution together
 Operationalize ML models to integrate them into applications
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
A Z U R E D A T A B R I C K S
 Easy to create and manage compute clusters that auto-scale
 Rapid development using the integrated workspace that
facilitates cross-team collaboration
 Interactive exploration with notebooks and dashboards
 Seamless integration with ML eco-system libraries and tools
 Deep Learning support with GPUs (coming soon in next release)
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Spark
SparkSQL Streaming MLlib GraphX
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks





The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
Simple construction, tuning, and testing for ML workflows
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
model = est2.fit(est1.fit(
 tf2.transform(tf1.transform(data)))
 .transform(
 tf2.transform(tf1.transform(data)))
 )
model = Pipeline(stages=[tf1, tf2, est1, es2]).fit(data)
28
Cross Validation
Model
Training
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}
29
Cross Validation
...
Best Model
Model #1
Training
Model #2
Training
Feature
Extraction
Model #3
Training
Microsoft Confidential
Advanced Analytics: Pipeline
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Data Science Software Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
3
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
3
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“path://...”)
Load Pipeline (Scala/Java)
Model.load(“path://…”)
Deploy in production
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Output
{
“id”:5923937,
“prediction”: 1.0
}
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
 Classification
 Logistic regression w/ elastic net
 Naive Bayes
 Streaming logistic regression
 Linear SVMs
 Decision trees
 Random forests
 Gradient-boosted trees
 Multilayer perceptron
 One-vs-rest
 Regression
 Least squares w/ elastic net
 Isotonic regression
 Decision trees
 Random forests
 Gradient-boosted trees
 Streaming linear methods
 Recommendation
 Alternating Least Squares
 Frequent itemsets
 FP-growth
 Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction & selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
And more…
4
• Classification
• Regression
• Recommendation
• Clustering
• Frequent itemsets
4
• Model
import/export
• Pipelines
• DataFrames
• Cross validation
• Feature
extraction &
selection
• Statistics
• Linear algebra
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
 Use Azure Databricks for scaling out ML task
 Leverage well-known model architectures
 MLLib Pipeline API simplifies ML workflows
 Leverage pre-trained models for common tasks
DeepImageFeaturizer.transform
10minutes
6hours
from import
DeepImageFeaturizer
.transform
from import
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
5
JFK
IAD
LAX
SFO
SEA
DFW
src dest delay tripid
SFO SEA 45 105892
3
id city state
SEA Seattle WA
vertex (node)
edge
vertex
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
JFK
IAD
LAX
SFO
SEA
DFW src dest delay tripid
SFO SEA 45 105892
3
LAX JFK 52 410022
4
id city state
SEA Seattle WA
SFO San Francisco CA
JFK New York NY
vertices DataFrame
edges
DataFrame
vertex
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
es)
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Then filter using vertex
& edge data.
paths.filter(“e1.delay > 20”)
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks

More Related Content

What's hot (20)

Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
Microsoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science Recap
Mark Tabladillo
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Why Power BI is the right tool for you
Why Power BI is the right tool for youWhy Power BI is the right tool for you
Why Power BI is the right tool for you
Marcos Freccia
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Einstieg in Machine Learning für Datenbankentwickler
Einstieg in Machine Learning für DatenbankentwicklerEinstieg in Machine Learning für Datenbankentwickler
Einstieg in Machine Learning für Datenbankentwickler
Sascha Dittmann
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Lace Lofranco
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
Microsoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science Recap
Mark Tabladillo
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Why Power BI is the right tool for you
Why Power BI is the right tool for youWhy Power BI is the right tool for you
Why Power BI is the right tool for you
Marcos Freccia
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Einstieg in Machine Learning für Datenbankentwickler
Einstieg in Machine Learning für DatenbankentwicklerEinstieg in Machine Learning für Datenbankentwickler
Einstieg in Machine Learning für Datenbankentwickler
Sascha Dittmann
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Lace Lofranco
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 

Similar to The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks (20)

Deep Learning Technical Pitch Deck
Deep Learning Technical Pitch DeckDeep Learning Technical Pitch Deck
Deep Learning Technical Pitch Deck
Nicholas Vossburg
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on Azure
Elena Lopez
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
AI for Good at Microsoft
AI for Good at MicrosoftAI for Good at Microsoft
AI for Good at Microsoft
Mark Hamilton
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
Mark Tabladillo
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
QuantUniversity
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Overview Microsoft's ML & AI tools
Overview Microsoft's ML & AI toolsOverview Microsoft's ML & AI tools
Overview Microsoft's ML & AI tools
David Voyles
 
UTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big DataUTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big Data
Marco Silva
 
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBig Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
BigDataExpo
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Deep Learning Technical Pitch Deck
Deep Learning Technical Pitch DeckDeep Learning Technical Pitch Deck
Deep Learning Technical Pitch Deck
Nicholas Vossburg
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on Azure
Elena Lopez
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
Jesus Rodriguez
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
AI for Good at Microsoft
AI for Good at MicrosoftAI for Good at Microsoft
AI for Good at Microsoft
Mark Hamilton
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
Mark Tabladillo
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
QuantUniversity
 
Overview Microsoft's ML & AI tools
Overview Microsoft's ML & AI toolsOverview Microsoft's ML & AI tools
Overview Microsoft's ML & AI tools
David Voyles
 
UTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big DataUTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big Data
Marco Silva
 
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBig Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
BigDataExpo
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
Ad

More from Microsoft Tech Community (20)

100 ways to use Yammer
100 ways to use Yammer100 ways to use Yammer
100 ways to use Yammer
Microsoft Tech Community
 
10 Yammer Group Suggestions
10 Yammer Group Suggestions10 Yammer Group Suggestions
10 Yammer Group Suggestions
Microsoft Tech Community
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
Microsoft Tech Community
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
Microsoft Tech Community
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
Microsoft Tech Community
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
Microsoft Tech Community
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
Microsoft Tech Community
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
Microsoft Tech Community
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
Microsoft Tech Community
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
Microsoft Tech Community
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
Microsoft Tech Community
 
DevOps for Data Science
DevOps for Data ScienceDevOps for Data Science
DevOps for Data Science
Microsoft Tech Community
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Microsoft Tech Community
 
Azure Functions and Microsoft Graph
Azure Functions and Microsoft GraphAzure Functions and Microsoft Graph
Azure Functions and Microsoft Graph
Microsoft Tech Community
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Microsoft Tech Community
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
Microsoft Tech Community
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
Microsoft Tech Community
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
Microsoft Tech Community
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
Microsoft Tech Community
 
Speech Devices SDK
Speech Devices SDKSpeech Devices SDK
Speech Devices SDK
Microsoft Tech Community
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
Microsoft Tech Community
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
Microsoft Tech Community
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
Microsoft Tech Community
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
Microsoft Tech Community
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
Microsoft Tech Community
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
Microsoft Tech Community
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
Microsoft Tech Community
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
Microsoft Tech Community
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Microsoft Tech Community
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Microsoft Tech Community
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
Microsoft Tech Community
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
Microsoft Tech Community
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
Microsoft Tech Community
 
Ad

Recently uploaded (20)

Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdfArtificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
 
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free DownloadViral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
 
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Impelsys Inc.
 
Developing Schemas with FME and Excel - Peak of Data & AI 2025
Developing Schemas with FME and Excel - Peak of Data & AI 2025Developing Schemas with FME and Excel - Peak of Data & AI 2025
Developing Schemas with FME and Excel - Peak of Data & AI 2025
Safe Software
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
If You Use Databricks, You Definitely Need FME
If You Use Databricks, You Definitely Need FMEIf You Use Databricks, You Definitely Need FME
If You Use Databricks, You Definitely Need FME
Safe Software
 
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven InfrastructureNo-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
 
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdfHow Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
Rejig Digital
 
Secure Access with Azure Active Directory
Secure Access with Azure Active DirectorySecure Access with Azure Active Directory
Secure Access with Azure Active Directory
VICTOR MAESTRE RAMIREZ
 
Oracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization ProgramOracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization Program
VICTOR MAESTRE RAMIREZ
 
Cisco ISE Performance, Scalability and Best Practices.pdf
Cisco ISE Performance, Scalability and Best Practices.pdfCisco ISE Performance, Scalability and Best Practices.pdf
Cisco ISE Performance, Scalability and Best Practices.pdf
superdpz
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdfvertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 
Kubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too LateKubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too Late
Michael Furman
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
How to Detect Outliers in IBM SPSS Statistics.pptx
How to Detect Outliers in IBM SPSS Statistics.pptxHow to Detect Outliers in IBM SPSS Statistics.pptx
How to Detect Outliers in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOMEstablish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Anchore
 
Domino IQ – What to Expect, First Steps and Use Cases
Domino IQ – What to Expect, First Steps and Use CasesDomino IQ – What to Expect, First Steps and Use Cases
Domino IQ – What to Expect, First Steps and Use Cases
panagenda
 
Your startup on AWS - How to architect and maintain a Lean and Mean account
Your startup on AWS - How to architect and maintain a Lean and Mean accountYour startup on AWS - How to architect and maintain a Lean and Mean account
Your startup on AWS - How to architect and maintain a Lean and Mean account
angelo60207
 
Introduction to Typescript - GDG On Campus EUE
Introduction to Typescript - GDG On Campus EUEIntroduction to Typescript - GDG On Campus EUE
Introduction to Typescript - GDG On Campus EUE
Google Developer Group On Campus European Universities in Egypt
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdfEdge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdfArtificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
 
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free DownloadViral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
 
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Creating an Accessible Future-How AI-powered Accessibility Testing is Shaping...
Impelsys Inc.
 
Developing Schemas with FME and Excel - Peak of Data & AI 2025
Developing Schemas with FME and Excel - Peak of Data & AI 2025Developing Schemas with FME and Excel - Peak of Data & AI 2025
Developing Schemas with FME and Excel - Peak of Data & AI 2025
Safe Software
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
If You Use Databricks, You Definitely Need FME
If You Use Databricks, You Definitely Need FMEIf You Use Databricks, You Definitely Need FME
If You Use Databricks, You Definitely Need FME
Safe Software
 
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven InfrastructureNo-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
 
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdfHow Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
How Advanced Environmental Detection Is Revolutionizing Oil & Gas Safety.pdf
Rejig Digital
 
Secure Access with Azure Active Directory
Secure Access with Azure Active DirectorySecure Access with Azure Active Directory
Secure Access with Azure Active Directory
VICTOR MAESTRE RAMIREZ
 
Oracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization ProgramOracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization Program
VICTOR MAESTRE RAMIREZ
 
Cisco ISE Performance, Scalability and Best Practices.pdf
Cisco ISE Performance, Scalability and Best Practices.pdfCisco ISE Performance, Scalability and Best Practices.pdf
Cisco ISE Performance, Scalability and Best Practices.pdf
superdpz
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdfvertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 
Kubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too LateKubernetes Security Act Now Before It’s Too Late
Kubernetes Security Act Now Before It’s Too Late
Michael Furman
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
How to Detect Outliers in IBM SPSS Statistics.pptx
How to Detect Outliers in IBM SPSS Statistics.pptxHow to Detect Outliers in IBM SPSS Statistics.pptx
How to Detect Outliers in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOMEstablish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOM
Anchore
 
Domino IQ – What to Expect, First Steps and Use Cases
Domino IQ – What to Expect, First Steps and Use CasesDomino IQ – What to Expect, First Steps and Use Cases
Domino IQ – What to Expect, First Steps and Use Cases
panagenda
 
Your startup on AWS - How to architect and maintain a Lean and Mean account
Your startup on AWS - How to architect and maintain a Lean and Mean accountYour startup on AWS - How to architect and maintain a Lean and Mean account
Your startup on AWS - How to architect and maintain a Lean and Mean account
angelo60207
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdfEdge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 

The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks

  • 3. Model & ServePrep & Train Databricks HDInsight Data Lake Analytics Custom apps Sensors and devices Store Blobs Data Lake Ingest Data Factory (Data movement, pipelines & orchestration) Machine Learning Cosmos DB SQL Data Warehouse Analysis Services Event Hub IoT Hub SQL Database Analytical dashboards Predictive apps Operational reports Intelligence B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E Business apps 10 01 SQLKafka
  • 4. A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
  • 5. A Z U R E D A T A B R I C K S Microsoft Azure
  • 9.  Infrastructure management  Data exploration and visualization at scale  Time to value - From model iterations to intelligence  Integrating with various ML tools to stitch a solution together  Operationalize ML models to integrate them into applications
  • 10. Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Azure Databricks Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits A Z U R E D A T A B R I C K S
  • 11.  Easy to create and manage compute clusters that auto-scale  Rapid development using the integrated workspace that facilitates cross-team collaboration  Interactive exploration with notebooks and dashboards  Seamless integration with ML eco-system libraries and tools  Deep Learning support with GPUs (coming soon in next release)
  • 22. Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 2 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble
  • 23. Simple construction, tuning, and testing for ML workflows
  • 27. model = est2.fit(est1.fit(  tf2.transform(tf1.transform(data)))  .transform(  tf2.transform(tf1.transform(data)))  ) model = Pipeline(stages=[tf1, tf2, est1, es2]).fit(data)
  • 29. 29 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  • 33. Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model 3
  • 34. Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline 3
  • 35. Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“path://...”) Load Pipeline (Scala/Java) Model.load(“path://…”) Deploy in production
  • 41.  Classification  Logistic regression w/ elastic net  Naive Bayes  Streaming logistic regression  Linear SVMs  Decision trees  Random forests  Gradient-boosted trees  Multilayer perceptron  One-vs-rest  Regression  Least squares w/ elastic net  Isotonic regression  Decision trees  Random forests  Gradient-boosted trees  Streaming linear methods  Recommendation  Alternating Least Squares  Frequent itemsets  FP-growth  Prefix span Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Model import/export Pipelines Feature extraction & selection • Binarizer • Bucketizer • Chi-Squared selection • CountVectorizer • Discrete cosine transform • ElementwiseProduct • Hashing term frequency • Inverse document frequency • MinMaxScaler • Ngram • Normalizer • One-Hot Encoder • PCA • PolynomialExpansion • RFormula • SQLTransformer • Standard scaler • StopWordsRemover • StringIndexer • Tokenizer • StringIndexer • VectorAssembler • VectorIndexer • VectorSlicer • Word2Vec And more… 4
  • 42. • Classification • Regression • Recommendation • Clustering • Frequent itemsets 4 • Model import/export • Pipelines • DataFrames • Cross validation • Feature extraction & selection • Statistics • Linear algebra
  • 48.  Use Azure Databricks for scaling out ML task  Leverage well-known model architectures  MLLib Pipeline API simplifies ML workflows  Leverage pre-trained models for common tasks
  • 54. 5 JFK IAD LAX SFO SEA DFW src dest delay tripid SFO SEA 45 105892 3 id city state SEA Seattle WA vertex (node) edge vertex
  • 58. JFK IAD LAX SFO SEA DFW src dest delay tripid SFO SEA 45 105892 3 LAX JFK 52 410022 4 id city state SEA Seattle WA SFO San Francisco CA JFK New York NY vertices DataFrame edges DataFrame vertex
  • 60. es)
  • 61. JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 63. Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...)

Editor's Notes

  • #18: Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  • #24: No time to mention: User-defined functions (UDFs) Optimizations: code gen, predicate pushdown
  • #29: Model training / tuning Regularization: parameter that controls how the linear model does on unseen data There is no single good value for the regularization parameter. One common method to find on is to try out different values. This technique is called CV: you split your training data into 2 sets: one set used to learn some parameters with a given regularization parameter, and another set to evaluate how well we are doing with the given parameter.
  • #31: 30
  • #36: Note this is loading into Spark.
  • #55: 54
  • #56: 55
  • #57: 56
  • #58: 57