SlideShare a Scribd company logo
The Polyglot
Data Scientist
Adventures with R, Python, and SQL
Audience Survey
• How many here have used:
– SQL?
– Python?
– R?
• What job titles do people have?
What We Won’t Cover
• Theories behind data science and machine learning
• Deep dive into Python
• Deep dive into R
• Deep dive into SQL Server
There is a data science VM available on
Azure. It won’t be covered in this
presentation.
See https://docs.microsoft.com/en-
us/sql/advanced-analytics/getting-started-
with-machine-learning-services for details.
Azure Support
What We Will Cover
• The Problem with Being a Polyglot
• What SQL Server + R or SQL Server + Python Solves
• A Glance at these in Action
Not a Microsoft sales person…
• Microsoft MVP in
Visual Studio
• Been into exploring
data most of my life
• Been in tech over 20
years
• Practitioner and
hobbyist, not
researcher
Sample Problem: Sensor Data
• Domain: House of Sadukie
• Problem: Temperature data is
stored miserably
• Goal: Display data in a
visualization that makes sense
Current Outcome – via MySQL & R
Polyglot
Knowing or using several languages
SQL Server
Data Scientist
A person employed to analyze and
interpret complex digital data, such as
the usage statistics of a website,
especially in order to assist a business
in its decision-making
Multi-Faceted Data Science
• Various categories:
– Statistics – modeling, sampling, clustering, reduction
– Mathematics – NSA, astronomers, military
– Data engineering – database/memory/file optimization, Hadoop, data flows
– Machine learning and algorithms
– Business – ROI optimization, decision sciences
– Software engineering – primarily polyglots in production code
– Visualization
– Spatial
Source: https://www.datasciencecentral.com/profiles/blogs/six-categories-of-
data-scientists
The Problem with Being a Polyglot
• Understanding strengths and weaknesses of the languages
• Knowing which language is appropriate for what situation
Multiple tools…
multiple solutions…
how many
programs do I
have to use?!?
And wouldn’t it be
awesome if I could
use one tool to do
most of the work?
What R and Python Have to Offer
for SQL
• Libraries specialized to handle data science domain problems
including:
– Visualization
– Data exploration
– Statistical and Mathematical Analysis
– Trending
– Regression
• Libraries + Data right from the source = quicker exploratory analysis
• Python and R are great working from one large table and branch for
different directions
– Which can inspire additional analyses
Sample Problem: Sensor Data
• Number of rows: 400k+
• 1 Table
• Questions to look into:
– What are temperature trends over
time?
– When are sensors going offline?
– What temperatures look spot on?
– What sensors are wavering in reads
and showing inconsistencies?
Bringing the Computation
to the Data
Advanced Analytics
in
SQL Server 2016/2017
• SQL Server 2016
• SQL Server R Services / Machine
Learning Services
• SQL Server 2017
• SQL Server R Services / Machine
Learning Services
• Python Support
Sample Problem: Sensor Data
• Possible Strategy:
– Use SQL to gather the data into a
dataset that has the most amount of
data to observe.
– Use Python or R to manipulate the
data results and allow for easy analysis
and substantial predictions based on
observations.
Not Just Windows!
R Server for Windows
R Server for Linux
- CentOS
- RHEL
- Ubuntu
- SUSE
R Server for Hadoop – cluster in the cloud
R Server for Teradata – not as Machine Learning
Server
SQL Server as our Base
R and/or Python on Top
Additional pieces provided by MachineML:
Microsoft Machine Learning Services, RevoScaleR, RevoScalePy
Microsoft
Machine Learning
Services
Machine Learning Services in SQL
Server
• Allows integration of other languages in SQL Server
– SQL Server 2016 can work with R
– SQL Server 2017 introduces Python support
• Scalable in that you can develop and test on a single machine
and then deploy to distributed or parallel processing platforms.
Platforms include:
– SQL Server on Windows
– Hadoop
– Spark
SQL Server Machine Learning
Services (In-Database)
• SQL Server R Services (In-Database) started in SQL Server 2016
• With SQL Server 2017, SQL Server Machine Learning Services (In-
Database) allows us to use R and Python within SQL Server
• Do not need to open IDE and SQL tools to accomplish the work –
no context switching needed!
• Can call libraries from Python or R to process data right within
SQL
Python vs R?
• SQL Server 2016? R
• SQL Server 2017? R and/or Python
• What are you familiar with?
• Look at tutorials – what makes sense?
• What features do you need and how are they supported by
Microsoft ML?
Python Support
• CPython 3.5
• revoscalepy – Python equivalents of RevoScaleR
• Remote compute contexts
• Also supports familiar libraries such as:
– scikit-learn
– Tensorflow
– Caffe
– Theano/Keras
R Code in SQL
DECLARE @rscript NVARCHAR(MAX);
SET @rscript = N'
SensorData <- SqlData;
print(summary(SensorData))';
DECLARE @sqlscript NVARCHAR(MAX);
SET @sqlscript = N'
SELECT * FROM Sensors;';
EXEC sp_execute_external_script
@language = N'R',
@script = @rscript,
@input_data_1 = @sqlscript,
@input_data_1_name = N'SqlData',
@output_data_1_name = N'SensorData';
Python Code in SQL
execute sp_execute_external_script
@language = N'Python',
@script = N'
summary = pandas.DataFrame.describe(InputDataSet)
print(summary.transpose())
',
@input_data_1 = N'SELECT * FROM Sensors';
GO
RevoScaleR and
RevoScalePy
What is RevoScaleR?
• A library written in R that includes functions for importing,
transforming, and analyzing data
• Scalable, portable, and easily distributable
• Things it can do include:
– Descriptive statistics
– Generalized linear models
– Logistic Regression
– Classification trees
– Decision forest
• Multithreaded and multinode
Running RevoScaleR
• Part of the Machine Learning Server and Microsoft R products
• Can use any R IDE to write scripts that use RevoScaleR
• Needs to be run on a computer with the interpreter and libraries
• Two modalities:
– Locally
– Remote compute context
– Shift execution to the server
– Windows server
– Hadoop
– Spark
Prediction
• Linear models
• Logistic regression models
• Generalized linear models
• Covariance and correlation
• Decision forest
• K-means clustering
Understanding Data with
RevoScaleR
Typical Workflow with RevoScaleRAnalyVVisuaMoveData
Import /
Export
TidyData
Clean
Manipulate
Transform
PresentData
Visualize
MakeDecisions
Analyze
Learn
Predict
Key Pieces for Analysis with
RevoScaleR
Data
Source
Compute
Context
Analytic
Function
Data Sources
• Comma-delimited text data
• SAS
• SPSS
• XDF
• ODBC
• Teradata
• SQL Server
Graphing
with
RevoScaleR
• rxHistogram
• rxLinePlot
• rxLorenz
• rxRocCurve
Descriptive Statistics
• rxQuantile
• rxSummary
• rxCrossTabs
• rxCube
Two Use Cases for Remote
Computer Context
• Running R in T-SQL scripts or stored procedures
• Calling RevoScaleR in R from a SQL context
Visual Studio 2017: One IDE with
Common Tools
• Python Tools for Visual Studio
• R Tools for Visual Studio
• SQL Server capabilities within Visual Studio
Additional Support
Polyglot Data Scientist Presentation
Resources
• R Services in SQL Server 2016 (Channel 9)
• Built-in machine learning in Microsoft SQL Server 2017 with Python
(Build 2017)
• MicrosoftML 1.3.0: What’s new for machine learning in Microsoft
R Server (Channel 9)
• Using Visual Studio for Machine Learning (Build 2017)
• Performance patterns for machine learning services in SQL Server
(Microsoft Ignite 2017)
Learn More
Resources
• Kaggle: The Home of Data Science and Machine Learning
• DataCamp: Learn R, Python, and Data Science Online
• Difference between Machine Learning, Data Science, AI, Deep
Learning, and Statistics – Vincent Granville
• Python Tutorial from Mode Analytics
• Coursera
– Mastering Software Development in R Specialization
– Data Science Specialization
– Applied Data Science with Python Specialization
– Executive Data Science Specialization
Contact Me
• Twitter: @sadukie
• Blog: http://codinggeekette.com
• Email:
sarah@cletechconsulting.com
Sarah Dutkiewicz
Cleveland Tech Consulting, LLC
Owner
Ad

Recommended

PPTX
Introduction to Testing and TDD
Sarah Dutkiewicz
 
PPTX
Intro to Python for C# Developers
Sarah Dutkiewicz
 
PPTX
.NET per la Data Science e oltre
Marco Parenzan
 
PDF
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
PDF
Introduction to apache spark
Muktadiur Rahman
 
PDF
Improving data interoperability in Python and R
Wes McKinney
 
PDF
Empowering Zillow’s Developers with Self-Service ETL
Databricks
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
Why & Where Knoldus Uses Rust?
Knoldus Inc.
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
PDF
Spark Worshop
Juan Pedro Moreno
 
KEY
scrazzl - A technical overview
scrazzl
 
PDF
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Databricks
 
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Databricks
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PDF
10 Things Learned Releasing Databricks Enterprise Wide
Databricks
 
PDF
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Work-Bench
 
PDF
Ncku csie talk about Spark
Giivee The
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
PDF
Unlock cassandra data for application developers using graphQL
Cédrick Lunven
 
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
PDF
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
PDF
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Luis Beltran
 
PDF
Predictive Analysis using Microsoft SQL Server R Services
Fisnik Doko
 

More Related Content

What's hot (20)

PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
Why & Where Knoldus Uses Rust?
Knoldus Inc.
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
PDF
Spark Worshop
Juan Pedro Moreno
 
KEY
scrazzl - A technical overview
scrazzl
 
PDF
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Databricks
 
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Databricks
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PDF
10 Things Learned Releasing Databricks Enterprise Wide
Databricks
 
PDF
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Work-Bench
 
PDF
Ncku csie talk about Spark
Giivee The
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
PDF
Unlock cassandra data for application developers using graphQL
Cédrick Lunven
 
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
PDF
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Data Science meets Software Development
Alexis Seigneurin
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Why & Where Knoldus Uses Rust?
Knoldus Inc.
 
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
Spark Worshop
Juan Pedro Moreno
 
scrazzl - A technical overview
scrazzl
 
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Databricks
 
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Databricks
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
10 Things Learned Releasing Databricks Enterprise Wide
Databricks
 
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Work-Bench
 
Ncku csie talk about Spark
Giivee The
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Unlock cassandra data for application developers using graphQL
Cédrick Lunven
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 

Similar to The Polyglot Data Scientist - Exploring R, Python, and SQL Server (20)

PDF
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Luis Beltran
 
PDF
Predictive Analysis using Microsoft SQL Server R Services
Fisnik Doko
 
PPTX
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
PDF
Using R services with Machine Learning
Eng Teong Cheah
 
PPTX
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Rui Quintino
 
PDF
microsoft r server for distributed computing
BAINIDA
 
PDF
Michal Marušan: Scalable R
GapData Institute
 
PDF
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp
 
PDF
Advanced analytics with R and SQL
MSDEVMTL
 
PPTX
Create a Data Science Lab with Microsoft and Open Source tools
Marcel Franke
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PPTX
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
PDF
Microsoft Data Science Technologies 201608
Mark Tabladillo
 
PDF
Microsoft Technologies for Data Science 201612
Mark Tabladillo
 
PPTX
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
PPTX
Microsoft R - ScaleR Overview
Khalid Salama
 
PPTX
Azure machine learning ile tahminleme modelleri
Koray Kocabas
 
PPTX
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
PPTX
Python vs R for Data Science: What’s the Difference? How can they automate?
iTrainMalaysia1
 
PPTX
SQL Server Ground to Cloud.pptx
saidbilgen
 
Sql Server Machine Learning Services - Sql Saturday Prague 2018 #SqlSatPrague
Luis Beltran
 
Predictive Analysis using Microsoft SQL Server R Services
Fisnik Doko
 
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Using R services with Machine Learning
Eng Teong Cheah
 
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Rui Quintino
 
microsoft r server for distributed computing
BAINIDA
 
Michal Marušan: Scalable R
GapData Institute
 
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp
 
Advanced analytics with R and SQL
MSDEVMTL
 
Create a Data Science Lab with Microsoft and Open Source tools
Marcel Franke
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Microsoft Data Science Technologies 201608
Mark Tabladillo
 
Microsoft Technologies for Data Science 201612
Mark Tabladillo
 
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
Microsoft R - ScaleR Overview
Khalid Salama
 
Azure machine learning ile tahminleme modelleri
Koray Kocabas
 
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
Python vs R for Data Science: What’s the Difference? How can they automate?
iTrainMalaysia1
 
SQL Server Ground to Cloud.pptx
saidbilgen
 
Ad

More from Sarah Dutkiewicz (20)

PPTX
Passwordless Development using Azure Identity
Sarah Dutkiewicz
 
PDF
Predicting Flights with Azure Databricks
Sarah Dutkiewicz
 
PPTX
Azure DevOps for Developers
Sarah Dutkiewicz
 
PPTX
Azure DevOps for JavaScript Developers
Sarah Dutkiewicz
 
PPTX
Azure DevOps for the Data Professional
Sarah Dutkiewicz
 
PPTX
Noodling with Data in Jupyter Notebook
Sarah Dutkiewicz
 
PPTX
Pairing and mobbing
Sarah Dutkiewicz
 
PDF
Becoming a Servant Leader, Leading from the Trenches
Sarah Dutkiewicz
 
PPTX
NEOISF - On Mentoring Future Techies
Sarah Dutkiewicz
 
PPTX
Becoming a Servant Leader
Sarah Dutkiewicz
 
PPTX
The importance of UX for Developers
Sarah Dutkiewicz
 
PPTX
The Impact of Women Trailblazers in Tech
Sarah Dutkiewicz
 
PDF
Unstoppable Course Final Presentation
Sarah Dutkiewicz
 
PDF
Even More Tools for the Developer's UX Toolbelt
Sarah Dutkiewicz
 
PPTX
History of Women in Tech
Sarah Dutkiewicz
 
PPTX
History of Women in Tech - Trivia
Sarah Dutkiewicz
 
PPTX
The UX Toolbelt for Developers
Sarah Dutkiewicz
 
PPTX
World Usability Day 2014 - UX Toolbelt for Developers
Sarah Dutkiewicz
 
PPTX
The UX Toolbelt for Developers
Sarah Dutkiewicz
 
PDF
The Case for the UX Developer
Sarah Dutkiewicz
 
Passwordless Development using Azure Identity
Sarah Dutkiewicz
 
Predicting Flights with Azure Databricks
Sarah Dutkiewicz
 
Azure DevOps for Developers
Sarah Dutkiewicz
 
Azure DevOps for JavaScript Developers
Sarah Dutkiewicz
 
Azure DevOps for the Data Professional
Sarah Dutkiewicz
 
Noodling with Data in Jupyter Notebook
Sarah Dutkiewicz
 
Pairing and mobbing
Sarah Dutkiewicz
 
Becoming a Servant Leader, Leading from the Trenches
Sarah Dutkiewicz
 
NEOISF - On Mentoring Future Techies
Sarah Dutkiewicz
 
Becoming a Servant Leader
Sarah Dutkiewicz
 
The importance of UX for Developers
Sarah Dutkiewicz
 
The Impact of Women Trailblazers in Tech
Sarah Dutkiewicz
 
Unstoppable Course Final Presentation
Sarah Dutkiewicz
 
Even More Tools for the Developer's UX Toolbelt
Sarah Dutkiewicz
 
History of Women in Tech
Sarah Dutkiewicz
 
History of Women in Tech - Trivia
Sarah Dutkiewicz
 
The UX Toolbelt for Developers
Sarah Dutkiewicz
 
World Usability Day 2014 - UX Toolbelt for Developers
Sarah Dutkiewicz
 
The UX Toolbelt for Developers
Sarah Dutkiewicz
 
The Case for the UX Developer
Sarah Dutkiewicz
 
Ad

Recently uploaded (20)

PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PPTX
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
PDF
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
PDF
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
PDF
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
PDF
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
PDF
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
PDF
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
PPTX
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PDF
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
PPTX
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
PDF
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 

The Polyglot Data Scientist - Exploring R, Python, and SQL Server

  • 1. The Polyglot Data Scientist Adventures with R, Python, and SQL
  • 2. Audience Survey • How many here have used: – SQL? – Python? – R? • What job titles do people have?
  • 3. What We Won’t Cover • Theories behind data science and machine learning • Deep dive into Python • Deep dive into R • Deep dive into SQL Server
  • 4. There is a data science VM available on Azure. It won’t be covered in this presentation. See https://docs.microsoft.com/en- us/sql/advanced-analytics/getting-started- with-machine-learning-services for details. Azure Support
  • 5. What We Will Cover • The Problem with Being a Polyglot • What SQL Server + R or SQL Server + Python Solves • A Glance at these in Action
  • 6. Not a Microsoft sales person… • Microsoft MVP in Visual Studio • Been into exploring data most of my life • Been in tech over 20 years • Practitioner and hobbyist, not researcher
  • 7. Sample Problem: Sensor Data • Domain: House of Sadukie • Problem: Temperature data is stored miserably • Goal: Display data in a visualization that makes sense
  • 8. Current Outcome – via MySQL & R
  • 9. Polyglot Knowing or using several languages
  • 11. Data Scientist A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making
  • 12. Multi-Faceted Data Science • Various categories: – Statistics – modeling, sampling, clustering, reduction – Mathematics – NSA, astronomers, military – Data engineering – database/memory/file optimization, Hadoop, data flows – Machine learning and algorithms – Business – ROI optimization, decision sciences – Software engineering – primarily polyglots in production code – Visualization – Spatial Source: https://www.datasciencecentral.com/profiles/blogs/six-categories-of- data-scientists
  • 13. The Problem with Being a Polyglot • Understanding strengths and weaknesses of the languages • Knowing which language is appropriate for what situation
  • 14. Multiple tools… multiple solutions… how many programs do I have to use?!? And wouldn’t it be awesome if I could use one tool to do most of the work?
  • 15. What R and Python Have to Offer for SQL • Libraries specialized to handle data science domain problems including: – Visualization – Data exploration – Statistical and Mathematical Analysis – Trending – Regression • Libraries + Data right from the source = quicker exploratory analysis • Python and R are great working from one large table and branch for different directions – Which can inspire additional analyses
  • 16. Sample Problem: Sensor Data • Number of rows: 400k+ • 1 Table • Questions to look into: – What are temperature trends over time? – When are sensors going offline? – What temperatures look spot on? – What sensors are wavering in reads and showing inconsistencies?
  • 18. Advanced Analytics in SQL Server 2016/2017 • SQL Server 2016 • SQL Server R Services / Machine Learning Services • SQL Server 2017 • SQL Server R Services / Machine Learning Services • Python Support
  • 19. Sample Problem: Sensor Data • Possible Strategy: – Use SQL to gather the data into a dataset that has the most amount of data to observe. – Use Python or R to manipulate the data results and allow for easy analysis and substantial predictions based on observations.
  • 20. Not Just Windows! R Server for Windows R Server for Linux - CentOS - RHEL - Ubuntu - SUSE R Server for Hadoop – cluster in the cloud R Server for Teradata – not as Machine Learning Server
  • 21. SQL Server as our Base R and/or Python on Top Additional pieces provided by MachineML: Microsoft Machine Learning Services, RevoScaleR, RevoScalePy
  • 23. Machine Learning Services in SQL Server • Allows integration of other languages in SQL Server – SQL Server 2016 can work with R – SQL Server 2017 introduces Python support • Scalable in that you can develop and test on a single machine and then deploy to distributed or parallel processing platforms. Platforms include: – SQL Server on Windows – Hadoop – Spark
  • 24. SQL Server Machine Learning Services (In-Database) • SQL Server R Services (In-Database) started in SQL Server 2016 • With SQL Server 2017, SQL Server Machine Learning Services (In- Database) allows us to use R and Python within SQL Server • Do not need to open IDE and SQL tools to accomplish the work – no context switching needed! • Can call libraries from Python or R to process data right within SQL
  • 25. Python vs R? • SQL Server 2016? R • SQL Server 2017? R and/or Python • What are you familiar with? • Look at tutorials – what makes sense? • What features do you need and how are they supported by Microsoft ML?
  • 26. Python Support • CPython 3.5 • revoscalepy – Python equivalents of RevoScaleR • Remote compute contexts • Also supports familiar libraries such as: – scikit-learn – Tensorflow – Caffe – Theano/Keras
  • 27. R Code in SQL DECLARE @rscript NVARCHAR(MAX); SET @rscript = N' SensorData <- SqlData; print(summary(SensorData))'; DECLARE @sqlscript NVARCHAR(MAX); SET @sqlscript = N' SELECT * FROM Sensors;'; EXEC sp_execute_external_script @language = N'R', @script = @rscript, @input_data_1 = @sqlscript, @input_data_1_name = N'SqlData', @output_data_1_name = N'SensorData';
  • 28. Python Code in SQL execute sp_execute_external_script @language = N'Python', @script = N' summary = pandas.DataFrame.describe(InputDataSet) print(summary.transpose()) ', @input_data_1 = N'SELECT * FROM Sensors'; GO
  • 30. What is RevoScaleR? • A library written in R that includes functions for importing, transforming, and analyzing data • Scalable, portable, and easily distributable • Things it can do include: – Descriptive statistics – Generalized linear models – Logistic Regression – Classification trees – Decision forest • Multithreaded and multinode
  • 31. Running RevoScaleR • Part of the Machine Learning Server and Microsoft R products • Can use any R IDE to write scripts that use RevoScaleR • Needs to be run on a computer with the interpreter and libraries • Two modalities: – Locally – Remote compute context – Shift execution to the server – Windows server – Hadoop – Spark
  • 32. Prediction • Linear models • Logistic regression models • Generalized linear models • Covariance and correlation • Decision forest • K-means clustering
  • 34. Typical Workflow with RevoScaleRAnalyVVisuaMoveData Import / Export TidyData Clean Manipulate Transform PresentData Visualize MakeDecisions Analyze Learn Predict
  • 35. Key Pieces for Analysis with RevoScaleR Data Source Compute Context Analytic Function
  • 36. Data Sources • Comma-delimited text data • SAS • SPSS • XDF • ODBC • Teradata • SQL Server
  • 38. Descriptive Statistics • rxQuantile • rxSummary • rxCrossTabs • rxCube
  • 39. Two Use Cases for Remote Computer Context • Running R in T-SQL scripts or stored procedures • Calling RevoScaleR in R from a SQL context
  • 40. Visual Studio 2017: One IDE with Common Tools • Python Tools for Visual Studio • R Tools for Visual Studio • SQL Server capabilities within Visual Studio
  • 42. Polyglot Data Scientist Presentation Resources • R Services in SQL Server 2016 (Channel 9) • Built-in machine learning in Microsoft SQL Server 2017 with Python (Build 2017) • MicrosoftML 1.3.0: What’s new for machine learning in Microsoft R Server (Channel 9) • Using Visual Studio for Machine Learning (Build 2017) • Performance patterns for machine learning services in SQL Server (Microsoft Ignite 2017)
  • 44. Resources • Kaggle: The Home of Data Science and Machine Learning • DataCamp: Learn R, Python, and Data Science Online • Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics – Vincent Granville • Python Tutorial from Mode Analytics • Coursera – Mastering Software Development in R Specialization – Data Science Specialization – Applied Data Science with Python Specialization – Executive Data Science Specialization
  • 45. Contact Me • Twitter: @sadukie • Blog: http://codinggeekette.com • Email: [email protected] Sarah Dutkiewicz Cleveland Tech Consulting, LLC Owner