SlideShare a Scribd company logo
Introducing
Microsoft R Server &
Microsoft R Open
Krit Kamtuo
Technical Evangelist
Microsoft (Thailand) Limited
What is R?
Language
Platform
Community
Ecosystem
â€Ē A programming language for statistics, analytics, and data science
â€Ē A data visualization framework
â€Ē Provided as Open Source
â€Ē Used by 2.5M+ data scientists, statisticians and analysts
â€Ē Taught in most university statistics programs
â€Ē Active and thriving user groups across the world
â€Ē CRAN: 7000+ freely available algorithms, test data and evaluation
â€Ē Many of these are applicable to big data if scaled
â€Ē New and recent graduates prefer it
20152009200420032000199719951993
Research
Projectin
New Zealand
Open
Source
Project
R-Core Group
R-1.0.0
released
R Foundation
First user
New York
Times
article
R-3.2.0 and
R Consortium
(foundedby
Microsoft)
History of R
$?
Challenges posed by open source R
Uncertain
total cost of
ownership
Inadequate
access to
important
business data
Limited
business
agility
Limited
business
value
R from Microsoft brings
6
â€Ē Free and open source R distribution
â€Ē Enhanced and distributed by Revolution Analytics
Microsoft R Open
â€Ē Built in Advanced Analytics and Stand Alone Server
Capability
â€Ē Leverages the Benefits of SQL 2016 Enterprise Edition
SQL Server R Services
Microsoft R Products
Microsoft R Server
â€Ē Microsoft R Server for Redhat Linux
â€Ē Microsoft R Server for SUSE Linux
â€Ē Microsoft R Server for Teradata DB
â€Ē Microsoft R Server for Hadoop on Redhat
Microsoft R Server
Introducing SQL Server 2016 R Services
Enterprise speed and
performance
Near-DB analytics
Parallel threading and
processing
Model on-premises, store
in cloud—or vice versa
Hybrid memory and disk
scalability
Not bound by memory-
enabling limits of larger
datasets
Included in SQL Server 2016
Reuse and optimize existing
R code
Eliminate data movement
across machines
Write once, deploy
anywhere
Microsoft R server for distributed computing
The First NIDA Business Analytics and Data Sciences Contest/Conference
āļ§āļąāļ™āļ—āļĩāđˆ 1-2 āļāļąāļ™āļĒāļēāļĒāļ™ 2559 āļ“ āļ­āļēāļ„āļēāļĢāļ™āļ§āļĄāļīāļ™āļ—āļĢāļēāļ˜āļīāļĢāļēāļŠ āļŠāļ–āļēāļšāļąāļ™āļšāļąāļ“āļ‘āļīāļ•āļžāļąāļ’āļ™āļšāļĢāļīāļŦāļēāļĢāļĻāļēāļŠāļ•āļĢāđŒ
-āđāļ™āļ°āļ™āđāļē Microsoft R Server
-Distributed Computing āļĄāļĩāļ§āļīāļ˜āļĩāļāļēāļĢāļ­āļĒāđˆāļēāļ‡āđ„āļĢ āđāļĨāļ°āļĄāļĩāļ›āļĢāļ°āđ‚āļĒāļŠāļ™āđŒāļ­āļĒāđˆāļēāļ‡āđ„āļĢ
-āđāļ™āļ°āļ™āđāļēāļ§āļīāļ˜āļĩāļāļēāļĢ Configuration āļŠāđāļēāļŦāļĢāļąāļš Distributed Computing
https://businessanalyticsnida.wordpress.com
https://www.facebook.com/BusinessAnalyticsNIDA/
āļāļĪāļĐāļāļīāđŒ āļ„āđāļēāļ•āļ·āđ‰āļ­,
Technical Evangelist,
Microsoft (Thailand)
-Distributed computing āļāļąāļš Big Data
-Analytics āļšāļ™ R server
-āļŠāļēāļ˜āļīāļ•āđāļĨāļ°āļŠāļ­āļ™āđƒāļ™āļĨāļąāļāļĐāļ“āļ° workshop
Computer Lab 2 āļŠāļąāđ‰āļ™ 10 āļ­āļēāļ„āļēāļĢāļŠāļĒāļēāļĄāļšāļĢāļĄāļĢāļēāļŠāļāļļāļĄāļēāļĢāļĩ
1 āļāļąāļ™āļĒāļēāļĒāļ™ 2559 āđ€āļ§āļĨāļē 9.00-12.30
Scalable in-database analytics
Data Scientist
Interacts directly with data
Creates models
and experiments
Data Analyst/DBA
Manages data and
analytics together
Example Solutions
â€Ē Fraud detection
â€Ē Sales forecasting
â€Ē Warehouse efficiency
â€Ē Predictive
maintenance
010010
100100
010101
Relational Data
Extensibility
?
R
R Integration
Analytic Library
Open Source R
Revolution PEMA
T-SQL Interface
How is it Integrated?
â€Ē T-SQL calls a Stored Procedure
â€Ē Script is run in SQL through
extensibility model
â€Ē Result sets sent through Web API
to database or applications
Benefits
â€Ē Faster deployment of ML models
â€Ē Less data movement, faster
insights
â€Ē Work with large datasets: mitigate
R memory and scalability
limitations
Cost effectiveness
â€Ē Best Advanced Analytics Value
â€Ē R Services and Polybase are built-in
o Part of SQL Server 2016 Enterprise Edition
â€Ē In DB analytics shrinks analysis cost and time
o No data movement reduces costs
â€Ē No Proprietary Hardware Requirement
o Can be installed in commodity hardware
â€Ē Integration between cloud and open source
offerings
SQL SERVER 2016
$ 648 K
+ $120 Per user for PowerBI
Costs based on a Server with
2 proc/ 8 Cores
11
High-performance open source R plus:
â€Ē Data source connectivity to big-data objects
â€Ē Big-data advanced analytics
â€Ē Multi-platform environment support
â€Ē In-Hadoop and in-Teradata predictive modeling
â€Ē Development and production environment support
â€Ē IDE for data scientist developers
â€Ē Secure, Scalable R Deployment
DeployR
R Open R Server
DevelopR
Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is
supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and
machine learning capabilities, R Server supports the full range of analytics – exploration, analysis,
visualization and modeling
Introducing Microsoft R Server
R Open MicrosoftR Server
DeployRDevelopR
The Microsoft R Server Platform
ConnectR
â€Ē High-speed & direct
connectors
Available for:
â€Ē High-performance XDF
â€Ē SAS, SPSS, delimited& fixed
format text data files
â€Ē Hadoop HDFS (text & XDF)
â€Ē Teradata Database & Aster
â€Ē EDWs and ADWs
â€Ē ODBC
ScaleR
â€Ē Ready-to-Use high-performance
big data big analytics
â€Ē Fully-parallelizedanalytics
â€Ē Data prep & data distillation
â€Ē Descriptive statistics & statistical tests
â€Ē Range of predictive functions
â€Ē User tools for distributingcustomizedR algorithms
across nodes
â€Ē Wide data sets supported – thousands of variables
DistributedR
â€Ē Distributed computingframework
â€Ē Delivers cross-platformportability
R+CRAN
â€Ē Open source R interpreter
â€Ē R 3.1.2
â€Ē Freely-available huge range of R
algorithms
â€Ē Algorithms callable by RevoR
â€Ē Embeddable in R scripts
â€Ē 100% Compatible with existingR scripts,
functions and packages
RevoR
â€Ē Performance enhancedR
interpreter
â€Ē Based on open source R
â€Ē Adds high-performance
math libraryto speed up
linear algebra functions
ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data
size. We handle Megabytes to Gigabytes to Terabytesâ€Ķ
Our ScaleR algorithms work
inside multiple cores / nodes
in parallel at high speed
Interim results are collected
and combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
16
SQL Server 2016 Enterprise Edition
SQL Server R Services
Integration Facilities:
â€Ē Component Integration
â€Ē Launchers
â€Ē Parameter Passing
â€Ē Results Return
â€Ē Console Output
Return
â€Ē Parallel Data Exchange
(RTM)
â€Ē Stored Procedures
â€Ē Package Administration
SQL Server
Query
Processor
Algorithm Library
â€Ē Data Prep
â€Ē Descriptive Stats
â€Ē Sampling
â€Ē Statistical Tests
â€Ē Predictive Models
â€Ē Variable Selection
â€Ē Clustering
â€Ē Classification
â€Ē Custom APIs for R + CRAN
â€Ē Parallel Scoring
Fast, Parallel, Storage Efficient Algorithms
Microsoft R Open
â€Ē 100% Open Source R
â€Ē Fully CRAN Compatible
â€Ē Accelerated Math
Open Source R
Interpreter
Run R In-Database from TSQL
SQL
Server
2016
In-Database
Execution of R
+ CRAN
+ SQL
In-Database Execution of:
 R Code
 CRAN Packages
Move the
Work to
the Data
Run R From
the Query
Processor
Retrieve
Models,
Scores,
Transformed
Data,
Plots/Images
Operationalise
scoring/predictio
n in database for
data batches or
real-time
SQL
In-Database Execution:
 Remote Execution
 Parallelized Compute SQL
Server
Remote
Execution
Context
Explore and Model:
 In Parallel, In-Database
 Parallelize distributable R and CRAN
Operationlize:
 Score In Parallel
Parallel
Worker
Tasks
Move
BIG
Work to
the Data
Large Data Sets in Chunks
Parallel
Algorithm
Iterate/
Sequence
Run Parallel Algorithms in Database from an R
client
SQL 2016
ScaleR PEMAs: Fast, Parallel,
Storage Efficient Algorithms
R Interpreter
Conceptual Flow
SQL
Processor
Data
Segments
(CTP3 is
via files)
R IDE
XSP
RTerm.exe
R.dll
(MSLP$
SQL16) BxlServer.exe
(MSLP$SQL16)
Input Data Set
via ODBC
ScaleR Master
Process
Worker Process
Worker Process
Worker Process
Data
Segments
Console Out
Spawn
Worker
Proc’s.
Assemble
Intermediate
Results
Iterate/
Sequence
MPI Ring
Results – Models, Data
Parallelized Algorithms in Database
22
Introducing Microsoft R Server
 Gradient Boosted Decision Trees
 NaÃŊve Bayes
Scale R – ParallelizedAlgorithms& Functions
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Preparation Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models
 K-Means
 Decision Trees
 Decision Forests
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
 rxDataStep
 rxExec
New
 PEMA-R API Custom Algorithms
ScaleR - Performance comparison
Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates
on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM
limits and parallel algorithms are much faster.
 US flight data for 20 years
 Linear Regression on Arrival Delay
 Run on 4 core laptop, 16GB RAM and 500GB SSD
DistributedR
ScaleR
ConnectR
DevelopR
DistributedR - Model development and model compute choice:
“Write Once. DeployAnywhere.”
Code Portability Across Platforms
In the Cloud
Workstations & Servers Linux
Windows
EDW Teradata
Hadoop
Hortonworks
Cloudera
MapR
+ HD Insights
+ Hadoop Spark
+ R Tools for
Visual Studio
+ Azure ML
Roadmap
Azure Marketplace
+ SQL Server v16
MicrosoftRServer
DistributedR - How Does RemoteExecutionWork?
Algorithm
Master
Big
Data
Predictive
Algorithm
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
The Results:
â€Ē Even Faster Computation
â€Ē Larger Data Set Capacity
â€Ē Fewer Security Concerns
â€Ē No Data Movement, No Copies
Work
“Pack and Ship” Requests
to Remote Environments
Results
Microsoft R Server functions
â€Ē A compute context defines remote connection
â€Ē Microsoft R functions prefixed with rx
â€Ē Current compute context determines processing
location
DistributedR - Revolution Code Portability
### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”)
, fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem() )
AirlineDataSet <-
RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”,
fileSystem = linuxFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
DistributedR - In-Hadoop
Uses Hadoop nodes for R
computations
Eliminate data movement
latency on very large data
Remove data duplication
Faster model development
No MapReduce R coding
Develop better models
using all the data
= Microsoft R Server
MRS and Hadoop Architecture options
R R R R R
R R R R R
ScaleR
Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
DistributedR - Hadoop ProcessingMethods
Method 1: Local (Linux) parallel processing using all
cores on one node, copying data from HDFS to store
in local Linux file-system.
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Linux (Local)
File-System
HDFS
Csv, Xdf
Processing
Data
1 Edge node 1:n data nodes
1:n disks 1:(n x number of
nodes) disks
Csv, Xdf
Linux FS
Read / write
Method 1
(“Beside” or “Edge”)
Copy
to
Local
File
Method 2: Local (Linux) parallel processing using
all cores on one node, streaming data from / to
HDFS
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Compute Context
Hadoop
Linux (Local)
File-System
HDFS
Csv, Xdf
1:n nodes
1:n disks 1:(n x number of
nodes) disks
1 Edge node
Method 3
Method 3: Hadoop (Map-Reduce) parallel processing
using all cores on n nodes, using HDFS data on each
node
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Compute Context
Hadoop
Linux (Local)
File-System
HDFS
Csv, Xdf
Processing
Data
1:n nodes
1:n disks 1:(n x number of
nodes) disks
Csv, Xdf
HDFS
Read / write
(“inside”)
R script
sent to
data
nodes
1 Edge node
R model script sent to Master Node:
1. Starts a master process
2. Distribute work
3. Master tasks for each node
4. Master initiates distributed work
1.Hadoop schedules mapper for each split
2.Algorithm computes intermediate result
3.Reducer combines intermediate results
5. Master process evaluates
completion
6. Iterates as required by the
algorithm
7. Returns consolidated answer to
script
DistributedR - What processing mode to use, when?
Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm)
guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop
processing)
Low Medium High
Small Data
< 10GB
Medium Data
< 50GB
Bigger Data
> 50GB
Edge Node Linux
processing
In-Hadoop
processing
Local Linux
file-system
Hadoop
file-system
Legend
Processing
Complexity
Data Size
While Open Source R delivers:
â€Ē Capability
â€Ē 6500+ Algorithm &
Connector Packages Available
for Free in CRAN
â€Ē Simplicity
â€Ē R Skills Transfer / Lower cost
of Talent
â€Ē Ease of Integration with Other
Analytics Packages & Data
â€Ē Access to Huge Libraries of R
Analytical Algorithms
â€Ē Speed
â€Ē Intel-Optimized Computation
â€Ē Peace of mind
â€Ē Knowledge that your business is using a stable platform backed with
commercial support and services
â€Ē Platform longevity for more predictability around costs
â€Ē Speed and scalability
â€Ē Faster decisions using advanced analytics that were previously unachievable
â€Ē In-Hadoop & In Teradata Analysis
â€Ē Efficiency
â€Ē Continue getting returns on existing hardware and software investments
â€Ē Developers can write code once and deploy it anywhere, keeping costs low
â€Ē Flexibility and agility
â€Ē Model data in a hybrid environment: on-premises, in the cloud, or both
â€Ē Scripting, modeling, and in-database analytics across platforms shrinks
analysis time and enables agile response to business needs
SQL Server R Services and Microsoft R Server deliver:
microsoft r server for distributed computing
Introducing Microsoft R Open
â€Ē Enhanced Open Source R distribution
â€Ē Based on the latestOpenSourceR (3.1.2)
â€Ē Built,testedanddistributed by Microsoft
â€Ē EnhancedbyIntelMKL Libraryto speedup linearalgebra functions
â€Ē Compatible with all R-related software
â€Ē CRANpackages,RStudio, third-partyR integrations,â€Ķ
â€Ē Revolutions Open-Source R packages
â€Ē ReproducibleR Toolkit– Checkpoint, miniCRAN
â€Ē ParallelR– parallelise execution via‘foreach’loop
â€Ē Rhadoop– rhdfs, rhbase,ravro,rmr2, plyrmr
â€Ē AzureML– read/writedatatoAzureML,publishR code asML API
â€Ē MRAN website mran.revolutionanalytics.com
â€Ē Enhanceddocumentation andlearningresources
â€Ē Discover6500 free add-on Rpackages
â€Ē Open source (GPLv2 license) - 100% free to download, use and share
Datasize
In-memory
In-memory In-Memoryor Disk Based
Speed of Analysis
Single threaded Multi-threaded
Multi-threaded, parallel
processing 1:N servers
Support
Community Community Community + Commercial
Analytic Breadth
& Depth 7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-
speed functions
Licence
Open Source
Open Source
Commercial license.
Supported release with
indemnity
CRAN, MRO, MRS Comparison
Microsoft
R Open
Microsoft
R Server
More efficient and multi-threaded math computation.
Benefits math intensive processing.
No benefit to program logic and data transform
CRAN R compared to Microsoft R Open
â€Ē Matrix calculation – upto 27x faster
â€Ē Matrix functions – upto 16x faster
â€Ē Programation – 0x faster
Ad

Recommended

PDF
Data Analytics in your IoT Solution Fukiat Julnual, Technical Evangelist, Mic...
BAINIDA
 
PDF
R Tool for Visual Studio āđāļĨāļ°āļāļēāļĢāļ—āļģāļ‡āļēāļ™āļĢāđˆāļ§āļĄāļāļąāļ™āđ€āļ›āđ‡āļ™āļ—āļĩāļĄ āđ‚āļ”āļĒ āđ€āļ‰āļĨāļīāļĄāļ§āļ‡āļĻāđŒ āļ§āļīāļˆāļīāļ•āļĢāļ›āļīāļĒāļ°āļāļļ...
BAINIDA
 
PDF
Towards Personalization in Global Digital Health
Databricks
 
PDF
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
Revolution Analytics
 
PPTX
3 Ways Tableau Improves Predictive Analytics
Nandita Nityanandam
 
PDF
Starting Your Modern DataOps Journey
CloverDX
 
ODP
Big Data Testing Strategies
Knoldus Inc.
 
PDF
A3P Exec Overview Whitepaper
David Knox
 
PPTX
Zsolt VÃĄrnai, Principal Software Engineer at Skyscanner - "The advantages of...
Dataconomy Media
 
PDF
Software Engineering for Data Scientists
Domino Data Lab
 
PPTX
Agile Data Science
Alexander Bauer
 
PPTX
Data warehouse testing
Er. Nawaraj Bhandari
 
PDF
Data Discoverability at SpotHero
Maggie Hays
 
PDF
DataOps - Production ML
Al Zindiq
 
PPT
Splunk .conf2011: Search Language: Intermediate
Erin Sweeney
 
PDF
Delivering Real-Time Business Value for Healthcare
SAP Technology
 
PDF
Building data "Py-pelines"
Rob Winters
 
PDF
Data profiling-best-practices
Blaise Cheuteu
 
PPTX
ALTERYX TOOL
Sagnik Banerjee
 
PDF
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
PPTX
Optier presentation for open analytics event
Open Analytics
 
PPTX
Ikanow oanyc summit
Open Analytics
 
PDF
Building A Self Service Analytics Platform on Hadoop
Craig Warman
 
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Juliet Hougland
 
PDF
Innovating With Data and Analytics
VMware Tanzu
 
PPTX
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
SPS Paris
 
PDF
R server and spark
BAINIDA
 
PDF
Microsoft R Server for Data Sciencea
Data Science Thailand
 
PPTX
R at Microsoft
Revolution Analytics
 
PDF
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 

More Related Content

What's hot (18)

PPTX
Zsolt VÃĄrnai, Principal Software Engineer at Skyscanner - "The advantages of...
Dataconomy Media
 
PDF
Software Engineering for Data Scientists
Domino Data Lab
 
PPTX
Agile Data Science
Alexander Bauer
 
PPTX
Data warehouse testing
Er. Nawaraj Bhandari
 
PDF
Data Discoverability at SpotHero
Maggie Hays
 
PDF
DataOps - Production ML
Al Zindiq
 
PPT
Splunk .conf2011: Search Language: Intermediate
Erin Sweeney
 
PDF
Delivering Real-Time Business Value for Healthcare
SAP Technology
 
PDF
Building data "Py-pelines"
Rob Winters
 
PDF
Data profiling-best-practices
Blaise Cheuteu
 
PPTX
ALTERYX TOOL
Sagnik Banerjee
 
PDF
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
PPTX
Optier presentation for open analytics event
Open Analytics
 
PPTX
Ikanow oanyc summit
Open Analytics
 
PDF
Building A Self Service Analytics Platform on Hadoop
Craig Warman
 
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Juliet Hougland
 
PDF
Innovating With Data and Analytics
VMware Tanzu
 
PPTX
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
SPS Paris
 
Zsolt VÃĄrnai, Principal Software Engineer at Skyscanner - "The advantages of...
Dataconomy Media
 
Software Engineering for Data Scientists
Domino Data Lab
 
Agile Data Science
Alexander Bauer
 
Data warehouse testing
Er. Nawaraj Bhandari
 
Data Discoverability at SpotHero
Maggie Hays
 
DataOps - Production ML
Al Zindiq
 
Splunk .conf2011: Search Language: Intermediate
Erin Sweeney
 
Delivering Real-Time Business Value for Healthcare
SAP Technology
 
Building data "Py-pelines"
Rob Winters
 
Data profiling-best-practices
Blaise Cheuteu
 
ALTERYX TOOL
Sagnik Banerjee
 
Building Real-Time Data Pipeline for Diabetes Medication Recommender System U...
Databricks
 
Optier presentation for open analytics event
Open Analytics
 
Ikanow oanyc summit
Open Analytics
 
Building A Self Service Analytics Platform on Hadoop
Craig Warman
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Juliet Hougland
 
Innovating With Data and Analytics
VMware Tanzu
 
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
SPS Paris
 

Viewers also liked (20)

PDF
R server and spark
BAINIDA
 
PDF
Microsoft R Server for Data Sciencea
Data Science Thailand
 
PPTX
R at Microsoft
Revolution Analytics
 
PDF
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
PPTX
The Value of Open Source Communities
Revolution Analytics
 
PDF
CopyofAResume
William Jones
 
PPT
Psicopedagoga rj.com.br Cadastro
PsicopedagogaRJ
 
PDF
Copia de resumen quÃĐ son los mapas conceptuales.doc%0 a
noeliavillar
 
PPTX
Pre Production (Planning)
Rahul Karavadra
 
PDF
Context Based Learning for GIS: an Interdisciplinary Approach
Patrick Rickles
 
PPTX
Creep Coursework Presentation
kess1a
 
PDF
Portfolio Draft
Victoria Esser
 
PPT
ÐŋÐūҀ҂҄ÐūÐŧÐļÐū ÐģÐūÐŧŅƒÐąÐūÐēÐļ҇
golubovicholga
 
PDF
Continuous Delivery at Oracle Database Insights
Michael Medin
 
PDF
Using puppet to leverage DevOps in Large Enterprise Oracle Environments
Bert Hajee
 
PPTX
Edition Based Redefinition - Continuous Database Application Evolution with O...
Lucas Jellema
 
PPTX
Nature and animal conservation by art
ART Raviteja akarapu
 
PPSX
Continuous Integration - Oracle Database Objects
Prabhu Ramasamy
 
PDF
ckitterman resume
craig kitterman
 
PPT
Twenty is Plenty
Bob Ward
 
R server and spark
BAINIDA
 
Microsoft R Server for Data Sciencea
Data Science Thailand
 
R at Microsoft
Revolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
The Value of Open Source Communities
Revolution Analytics
 
CopyofAResume
William Jones
 
Psicopedagoga rj.com.br Cadastro
PsicopedagogaRJ
 
Copia de resumen quÃĐ son los mapas conceptuales.doc%0 a
noeliavillar
 
Pre Production (Planning)
Rahul Karavadra
 
Context Based Learning for GIS: an Interdisciplinary Approach
Patrick Rickles
 
Creep Coursework Presentation
kess1a
 
Portfolio Draft
Victoria Esser
 
ÐŋÐūҀ҂҄ÐūÐŧÐļÐū ÐģÐūÐŧŅƒÐąÐūÐēÐļ҇
golubovicholga
 
Continuous Delivery at Oracle Database Insights
Michael Medin
 
Using puppet to leverage DevOps in Large Enterprise Oracle Environments
Bert Hajee
 
Edition Based Redefinition - Continuous Database Application Evolution with O...
Lucas Jellema
 
Nature and animal conservation by art
ART Raviteja akarapu
 
Continuous Integration - Oracle Database Objects
Prabhu Ramasamy
 
ckitterman resume
craig kitterman
 
Twenty is Plenty
Bob Ward
 
Ad

Similar to microsoft r server for distributed computing (20)

PPTX
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
PPTX
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
PDF
Michal MaruÅĄan: Scalable R
GapData Institute
 
PDF
In-Database Analytics Deep Dive with Teradata and Revolution
Revolution Analytics
 
PPTX
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
PPTX
eRum2016 -RevoScaleR - Performance and Scalability R
Łukasz Grala
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Advanced analytics with R and SQL
MSDEVMTL
 
PDF
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
PPTX
Decision trees in hadoop
Revolution Analytics
 
PPTX
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
PDF
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
JÞrgen Ambrosi
 
PPTX
Analytics Beyond RAM Capacity using R
Alex Palamides
 
PDF
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
PPTX
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Rui Quintino
 
PDF
Analytics with R in SQL Server 2016
HARIHARAN R
 
PDF
Introduction to Microsoft R Services
Gregg Barrett
 
PPTX
Introduction to basic statistics
IBM
 
PDF
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Michal MaruÅĄan: Scalable R
GapData Institute
 
In-Database Analytics Deep Dive with Teradata and Revolution
Revolution Analytics
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
eRum2016 -RevoScaleR - Performance and Scalability R
Łukasz Grala
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Advanced analytics with R and SQL
MSDEVMTL
 
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Decision trees in hadoop
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
JÞrgen Ambrosi
 
Analytics Beyond RAM Capacity using R
Alex Palamides
 
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Rui Quintino
 
Analytics with R in SQL Server 2016
HARIHARAN R
 
Introduction to Microsoft R Services
Gregg Barrett
 
Introduction to basic statistics
IBM
 
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
Ad

More from BAINIDA (20)

PDF
āļ”āļ™āļ•āļĢāļĩāļ‚āļ­āļ‡āļžāļĢāļ°āđ€āļˆāđ‰āļēāđāļœāđˆāļ™āļ”āļīāļ™ āļ­āļēāļ™āļ™āļ—āđŒ āļĻāļąāļāļ”āļīāđŒāļ§āļĢāļ§āļīāļŠāļāđŒ āļŠāļļāļĢāļžāļ‡āļĐāđŒ āļšāđ‰āļēāļ™āđ„āļāļĢāļ—āļ­āļ‡ āļŦāļ­āļ›āļĢāļ°āļŠāļļāļĄāļ§āļ›āļ­ 7...
BAINIDA
 
PDF
Mixed methods in social and behavioral sciences
BAINIDA
 
PDF
Advanced quantitative research methods in political science and pa
BAINIDA
 
PPTX
Latest thailand election2019report
BAINIDA
 
PDF
Data science in medicine
BAINIDA
 
PPTX
Nursing data science
BAINIDA
 
PDF
Financial time series analysis with R@the 3rd NIDA BADS conference by Asst. p...
BAINIDA
 
PDF
Statistics and big data for justice and fairness
BAINIDA
 
PDF
Data science and big data for business and industrial application
BAINIDA
 
PDF
Update trend: Free digital marketing metrics for start-up
BAINIDA
 
PDF
Advent of ds and stat adjustment
BAINIDA
 
PPTX
āđ€āļĄāļ·āđˆāļ­ Data Science āđ€āļ‚āđ‰āļēāļĄāļē āļŠāļ–āļīāļ•āļīāļĻāļēāļŠāļ•āļĢāđŒāļˆāļ°āļ›āļĢāļąāļšāļ•āļąāļ§āļ­āļĒāđˆāļēāļ‡āđ„āļĢ
BAINIDA
 
PPTX
Data visualization. map
BAINIDA
 
PPTX
Dark data by Worapol Alex Pongpech
BAINIDA
 
PDF
Deepcut Thai word Segmentation @ NIDA
BAINIDA
 
PPTX
Professionals and wanna be in Business Analytics and Data Science
BAINIDA
 
PDF
Deep learning and image analytics using Python by Dr Sanparit
BAINIDA
 
PDF
Visualizing for impact final
BAINIDA
 
PPTX
Python programming workshop
BAINIDA
 
PDF
Second prize business plan @ the First NIDA business analytics and data scien...
BAINIDA
 
āļ”āļ™āļ•āļĢāļĩāļ‚āļ­āļ‡āļžāļĢāļ°āđ€āļˆāđ‰āļēāđāļœāđˆāļ™āļ”āļīāļ™ āļ­āļēāļ™āļ™āļ—āđŒ āļĻāļąāļāļ”āļīāđŒāļ§āļĢāļ§āļīāļŠāļāđŒ āļŠāļļāļĢāļžāļ‡āļĐāđŒ āļšāđ‰āļēāļ™āđ„āļāļĢāļ—āļ­āļ‡ āļŦāļ­āļ›āļĢāļ°āļŠāļļāļĄāļ§āļ›āļ­ 7...
BAINIDA
 
Mixed methods in social and behavioral sciences
BAINIDA
 
Advanced quantitative research methods in political science and pa
BAINIDA
 
Latest thailand election2019report
BAINIDA
 
Data science in medicine
BAINIDA
 
Nursing data science
BAINIDA
 
Financial time series analysis with R@the 3rd NIDA BADS conference by Asst. p...
BAINIDA
 
Statistics and big data for justice and fairness
BAINIDA
 
Data science and big data for business and industrial application
BAINIDA
 
Update trend: Free digital marketing metrics for start-up
BAINIDA
 
Advent of ds and stat adjustment
BAINIDA
 
āđ€āļĄāļ·āđˆāļ­ Data Science āđ€āļ‚āđ‰āļēāļĄāļē āļŠāļ–āļīāļ•āļīāļĻāļēāļŠāļ•āļĢāđŒāļˆāļ°āļ›āļĢāļąāļšāļ•āļąāļ§āļ­āļĒāđˆāļēāļ‡āđ„āļĢ
BAINIDA
 
Data visualization. map
BAINIDA
 
Dark data by Worapol Alex Pongpech
BAINIDA
 
Deepcut Thai word Segmentation @ NIDA
BAINIDA
 
Professionals and wanna be in Business Analytics and Data Science
BAINIDA
 
Deep learning and image analytics using Python by Dr Sanparit
BAINIDA
 
Visualizing for impact final
BAINIDA
 
Python programming workshop
BAINIDA
 
Second prize business plan @ the First NIDA business analytics and data scien...
BAINIDA
 

Recently uploaded (20)

PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
PDF
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PDF
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
PPTX
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
PPTX
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
PDF
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PPTX
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
PPTX
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
PPTX
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
PPTX
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
PPTX
Peer Teaching Observations During School Internship
AjayaMohanty7
 
PPTX
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
PDF
VCE Literature Section A Exam Response Guide
jpinnuck
 
PPTX
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
DOCX
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
ESP 10 Edukasyon sa Pagpapakatao PowerPoint Lessons Quarter 1.pptx
Sir J.
 
How Physics Enhances Our Quality of Life.pptx
AngeliqueTolentinoDe
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
Peer Teaching Observations During School Internship
AjayaMohanty7
 
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
VCE Literature Section A Exam Response Guide
jpinnuck
 
Tanja Vujicic - PISA for Schools contact Info
EduSkills OECD
 
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 

microsoft r server for distributed computing

  • 1. Introducing Microsoft R Server & Microsoft R Open Krit Kamtuo Technical Evangelist Microsoft (Thailand) Limited
  • 2. What is R? Language Platform Community Ecosystem â€Ē A programming language for statistics, analytics, and data science â€Ē A data visualization framework â€Ē Provided as Open Source â€Ē Used by 2.5M+ data scientists, statisticians and analysts â€Ē Taught in most university statistics programs â€Ē Active and thriving user groups across the world â€Ē CRAN: 7000+ freely available algorithms, test data and evaluation â€Ē Many of these are applicable to big data if scaled â€Ē New and recent graduates prefer it
  • 3. 20152009200420032000199719951993 Research Projectin New Zealand Open Source Project R-Core Group R-1.0.0 released R Foundation First user New York Times article R-3.2.0 and R Consortium (foundedby Microsoft) History of R
  • 4. $? Challenges posed by open source R Uncertain total cost of ownership Inadequate access to important business data Limited business agility Limited business value
  • 6. 6 â€Ē Free and open source R distribution â€Ē Enhanced and distributed by Revolution Analytics Microsoft R Open â€Ē Built in Advanced Analytics and Stand Alone Server Capability â€Ē Leverages the Benefits of SQL 2016 Enterprise Edition SQL Server R Services Microsoft R Products
  • 7. Microsoft R Server â€Ē Microsoft R Server for Redhat Linux â€Ē Microsoft R Server for SUSE Linux â€Ē Microsoft R Server for Teradata DB â€Ē Microsoft R Server for Hadoop on Redhat Microsoft R Server
  • 8. Introducing SQL Server 2016 R Services Enterprise speed and performance Near-DB analytics Parallel threading and processing Model on-premises, store in cloud—or vice versa Hybrid memory and disk scalability Not bound by memory- enabling limits of larger datasets Included in SQL Server 2016 Reuse and optimize existing R code Eliminate data movement across machines Write once, deploy anywhere
  • 9. Microsoft R server for distributed computing The First NIDA Business Analytics and Data Sciences Contest/Conference āļ§āļąāļ™āļ—āļĩāđˆ 1-2 āļāļąāļ™āļĒāļēāļĒāļ™ 2559 āļ“ āļ­āļēāļ„āļēāļĢāļ™āļ§āļĄāļīāļ™āļ—āļĢāļēāļ˜āļīāļĢāļēāļŠ āļŠāļ–āļēāļšāļąāļ™āļšāļąāļ“āļ‘āļīāļ•āļžāļąāļ’āļ™āļšāļĢāļīāļŦāļēāļĢāļĻāļēāļŠāļ•āļĢāđŒ -āđāļ™āļ°āļ™āđāļē Microsoft R Server -Distributed Computing āļĄāļĩāļ§āļīāļ˜āļĩāļāļēāļĢāļ­āļĒāđˆāļēāļ‡āđ„āļĢ āđāļĨāļ°āļĄāļĩāļ›āļĢāļ°āđ‚āļĒāļŠāļ™āđŒāļ­āļĒāđˆāļēāļ‡āđ„āļĢ -āđāļ™āļ°āļ™āđāļēāļ§āļīāļ˜āļĩāļāļēāļĢ Configuration āļŠāđāļēāļŦāļĢāļąāļš Distributed Computing https://businessanalyticsnida.wordpress.com https://www.facebook.com/BusinessAnalyticsNIDA/ āļāļĪāļĐāļāļīāđŒ āļ„āđāļēāļ•āļ·āđ‰āļ­, Technical Evangelist, Microsoft (Thailand) -Distributed computing āļāļąāļš Big Data -Analytics āļšāļ™ R server -āļŠāļēāļ˜āļīāļ•āđāļĨāļ°āļŠāļ­āļ™āđƒāļ™āļĨāļąāļāļĐāļ“āļ° workshop Computer Lab 2 āļŠāļąāđ‰āļ™ 10 āļ­āļēāļ„āļēāļĢāļŠāļĒāļēāļĄāļšāļĢāļĄāļĢāļēāļŠāļāļļāļĄāļēāļĢāļĩ 1 āļāļąāļ™āļĒāļēāļĒāļ™ 2559 āđ€āļ§āļĨāļē 9.00-12.30
  • 10. Scalable in-database analytics Data Scientist Interacts directly with data Creates models and experiments Data Analyst/DBA Manages data and analytics together Example Solutions â€Ē Fraud detection â€Ē Sales forecasting â€Ē Warehouse efficiency â€Ē Predictive maintenance 010010 100100 010101 Relational Data Extensibility ? R R Integration Analytic Library Open Source R Revolution PEMA T-SQL Interface How is it Integrated? â€Ē T-SQL calls a Stored Procedure â€Ē Script is run in SQL through extensibility model â€Ē Result sets sent through Web API to database or applications Benefits â€Ē Faster deployment of ML models â€Ē Less data movement, faster insights â€Ē Work with large datasets: mitigate R memory and scalability limitations
  • 11. Cost effectiveness â€Ē Best Advanced Analytics Value â€Ē R Services and Polybase are built-in o Part of SQL Server 2016 Enterprise Edition â€Ē In DB analytics shrinks analysis cost and time o No data movement reduces costs â€Ē No Proprietary Hardware Requirement o Can be installed in commodity hardware â€Ē Integration between cloud and open source offerings SQL SERVER 2016 $ 648 K + $120 Per user for PowerBI Costs based on a Server with 2 proc/ 8 Cores
  • 12. 11
  • 13. High-performance open source R plus: â€Ē Data source connectivity to big-data objects â€Ē Big-data advanced analytics â€Ē Multi-platform environment support â€Ē In-Hadoop and in-Teradata predictive modeling â€Ē Development and production environment support â€Ē IDE for data scientist developers â€Ē Secure, Scalable R Deployment DeployR R Open R Server DevelopR Microsoft R Server is a broadly deployable enterprise-class analytics platform based on R that is supported, scalable and secure. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, R Server supports the full range of analytics – exploration, analysis, visualization and modeling Introducing Microsoft R Server
  • 14. R Open MicrosoftR Server DeployRDevelopR The Microsoft R Server Platform ConnectR â€Ē High-speed & direct connectors Available for: â€Ē High-performance XDF â€Ē SAS, SPSS, delimited& fixed format text data files â€Ē Hadoop HDFS (text & XDF) â€Ē Teradata Database & Aster â€Ē EDWs and ADWs â€Ē ODBC ScaleR â€Ē Ready-to-Use high-performance big data big analytics â€Ē Fully-parallelizedanalytics â€Ē Data prep & data distillation â€Ē Descriptive statistics & statistical tests â€Ē Range of predictive functions â€Ē User tools for distributingcustomizedR algorithms across nodes â€Ē Wide data sets supported – thousands of variables DistributedR â€Ē Distributed computingframework â€Ē Delivers cross-platformportability R+CRAN â€Ē Open source R interpreter â€Ē R 3.1.2 â€Ē Freely-available huge range of R algorithms â€Ē Algorithms callable by RevoR â€Ē Embeddable in R scripts â€Ē 100% Compatible with existingR scripts, functions and packages RevoR â€Ē Performance enhancedR interpreter â€Ē Based on open source R â€Ē Adds high-performance math libraryto speed up linear algebra functions
  • 15. ScaleR – Parallel + “Big Data” Stream data in to RAM in blocks. “Big Data” can be any data size. We handle Megabytes to Gigabytes to Terabytesâ€Ķ Our ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.
  • 16. 16
  • 17. SQL Server 2016 Enterprise Edition SQL Server R Services Integration Facilities: â€Ē Component Integration â€Ē Launchers â€Ē Parameter Passing â€Ē Results Return â€Ē Console Output Return â€Ē Parallel Data Exchange (RTM) â€Ē Stored Procedures â€Ē Package Administration SQL Server Query Processor Algorithm Library â€Ē Data Prep â€Ē Descriptive Stats â€Ē Sampling â€Ē Statistical Tests â€Ē Predictive Models â€Ē Variable Selection â€Ē Clustering â€Ē Classification â€Ē Custom APIs for R + CRAN â€Ē Parallel Scoring Fast, Parallel, Storage Efficient Algorithms Microsoft R Open â€Ē 100% Open Source R â€Ē Fully CRAN Compatible â€Ē Accelerated Math Open Source R Interpreter
  • 18. Run R In-Database from TSQL SQL Server 2016 In-Database Execution of R + CRAN + SQL In-Database Execution of:  R Code  CRAN Packages Move the Work to the Data Run R From the Query Processor Retrieve Models, Scores, Transformed Data, Plots/Images Operationalise scoring/predictio n in database for data batches or real-time
  • 19. SQL In-Database Execution:  Remote Execution  Parallelized Compute SQL Server Remote Execution Context Explore and Model:  In Parallel, In-Database  Parallelize distributable R and CRAN Operationlize:  Score In Parallel Parallel Worker Tasks Move BIG Work to the Data Large Data Sets in Chunks Parallel Algorithm Iterate/ Sequence Run Parallel Algorithms in Database from an R client
  • 20. SQL 2016 ScaleR PEMAs: Fast, Parallel, Storage Efficient Algorithms R Interpreter Conceptual Flow
  • 21. SQL Processor Data Segments (CTP3 is via files) R IDE XSP RTerm.exe R.dll (MSLP$ SQL16) BxlServer.exe (MSLP$SQL16) Input Data Set via ODBC ScaleR Master Process Worker Process Worker Process Worker Process Data Segments Console Out Spawn Worker Proc’s. Assemble Intermediate Results Iterate/ Sequence MPI Ring Results – Models, Data Parallelized Algorithms in Database
  • 22. 22
  • 24.  Gradient Boosted Decision Trees  NaÃŊve Bayes Scale R – ParallelizedAlgorithms& Functions  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Preparation Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination  rxDataStep  rxExec New  PEMA-R API Custom Algorithms
  • 25. ScaleR - Performance comparison Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.  US flight data for 20 years  Linear Regression on Arrival Delay  Run on 4 core laptop, 16GB RAM and 500GB SSD
  • 26. DistributedR ScaleR ConnectR DevelopR DistributedR - Model development and model compute choice: “Write Once. DeployAnywhere.” Code Portability Across Platforms In the Cloud Workstations & Servers Linux Windows EDW Teradata Hadoop Hortonworks Cloudera MapR + HD Insights + Hadoop Spark + R Tools for Visual Studio + Azure ML Roadmap Azure Marketplace + SQL Server v16 MicrosoftRServer
  • 27. DistributedR - How Does RemoteExecutionWork? Algorithm Master Big Data Predictive Algorithm Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results The Results: â€Ē Even Faster Computation â€Ē Larger Data Set Capacity â€Ē Fewer Security Concerns â€Ē No Data Movement, No Copies Work “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions â€Ē A compute context defines remote connection â€Ē Microsoft R functions prefixed with rx â€Ē Current compute context determines processing location
  • 28. DistributedR - Revolution Code Portability ### SETUP HADOOP ENVIRONMENT VARIABLES ### myHadoopCCC <- RxHadoopMR() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(myHadoopCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”) , fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem() ) AirlineDataSet <- RxXdfData(“AirlineDemoSmall/AirlineDemoSmall.xdf”, fileSystem = linuxFS) Local Parallel processing – Linux or Windows In – Hadoop ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script – sets where the model will run Functional model R script – does not need to change to run in Hadoop
  • 29. DistributedR - In-Hadoop Uses Hadoop nodes for R computations Eliminate data movement latency on very large data Remove data duplication Faster model development No MapReduce R coding Develop better models using all the data = Microsoft R Server
  • 30. MRS and Hadoop Architecture options R R R R R R R R R R ScaleR Production RStudio Server Pro Microsoft R Server 1. Copy 2. Stream 3. Send
  • 31. DistributedR - Hadoop ProcessingMethods Method 1: Local (Linux) parallel processing using all cores on one node, copying data from HDFS to store in local Linux file-system. Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Linux (Local) File-System HDFS Csv, Xdf Processing Data 1 Edge node 1:n data nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf Linux FS Read / write Method 1 (“Beside” or “Edge”) Copy to Local File Method 2: Local (Linux) parallel processing using all cores on one node, streaming data from / to HDFS Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf 1:n nodes 1:n disks 1:(n x number of nodes) disks 1 Edge node
  • 32. Method 3 Method 3: Hadoop (Map-Reduce) parallel processing using all cores on n nodes, using HDFS data on each node Compute Context HadoopCompute Context HadoopCompute Context Local Parallel Compute Context Hadoop Linux (Local) File-System HDFS Csv, Xdf Processing Data 1:n nodes 1:n disks 1:(n x number of nodes) disks Csv, Xdf HDFS Read / write (“inside”) R script sent to data nodes 1 Edge node R model script sent to Master Node: 1. Starts a master process 2. Distribute work 3. Master tasks for each node 4. Master initiates distributed work 1.Hadoop schedules mapper for each split 2.Algorithm computes intermediate result 3.Reducer combines intermediate results 5. Master process evaluates completion 6. Iterates as required by the algorithm 7. Returns consolidated answer to script
  • 33. DistributedR - What processing mode to use, when? Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm) guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop processing) Low Medium High Small Data < 10GB Medium Data < 50GB Bigger Data > 50GB Edge Node Linux processing In-Hadoop processing Local Linux file-system Hadoop file-system Legend Processing Complexity Data Size
  • 34. While Open Source R delivers: â€Ē Capability â€Ē 6500+ Algorithm & Connector Packages Available for Free in CRAN â€Ē Simplicity â€Ē R Skills Transfer / Lower cost of Talent â€Ē Ease of Integration with Other Analytics Packages & Data â€Ē Access to Huge Libraries of R Analytical Algorithms â€Ē Speed â€Ē Intel-Optimized Computation â€Ē Peace of mind â€Ē Knowledge that your business is using a stable platform backed with commercial support and services â€Ē Platform longevity for more predictability around costs â€Ē Speed and scalability â€Ē Faster decisions using advanced analytics that were previously unachievable â€Ē In-Hadoop & In Teradata Analysis â€Ē Efficiency â€Ē Continue getting returns on existing hardware and software investments â€Ē Developers can write code once and deploy it anywhere, keeping costs low â€Ē Flexibility and agility â€Ē Model data in a hybrid environment: on-premises, in the cloud, or both â€Ē Scripting, modeling, and in-database analytics across platforms shrinks analysis time and enables agile response to business needs SQL Server R Services and Microsoft R Server deliver:
  • 36. Introducing Microsoft R Open â€Ē Enhanced Open Source R distribution â€Ē Based on the latestOpenSourceR (3.1.2) â€Ē Built,testedanddistributed by Microsoft â€Ē EnhancedbyIntelMKL Libraryto speedup linearalgebra functions â€Ē Compatible with all R-related software â€Ē CRANpackages,RStudio, third-partyR integrations,â€Ķ â€Ē Revolutions Open-Source R packages â€Ē ReproducibleR Toolkit– Checkpoint, miniCRAN â€Ē ParallelR– parallelise execution via‘foreach’loop â€Ē Rhadoop– rhdfs, rhbase,ravro,rmr2, plyrmr â€Ē AzureML– read/writedatatoAzureML,publishR code asML API â€Ē MRAN website mran.revolutionanalytics.com â€Ē Enhanceddocumentation andlearningresources â€Ē Discover6500 free add-on Rpackages â€Ē Open source (GPLv2 license) - 100% free to download, use and share
  • 37. Datasize In-memory In-memory In-Memoryor Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high- speed functions Licence Open Source Open Source Commercial license. Supported release with indemnity CRAN, MRO, MRS Comparison Microsoft R Open Microsoft R Server
  • 38. More efficient and multi-threaded math computation. Benefits math intensive processing. No benefit to program logic and data transform CRAN R compared to Microsoft R Open â€Ē Matrix calculation – upto 27x faster â€Ē Matrix functions – upto 16x faster â€Ē Programation – 0x faster