SlideShare a Scribd company logo
Create a Data Science Lab with Microsoft and Open Source tools
Create a Data Science Lab with
Microsoft and Open Source Tools
Marcel Franke, pmOne AG, Germany
About me – Marcel Franke
Practice Lead Advanced Analytics & Data Science
pmOne AG – Germany, Austria, Switzerland
>10 years experiences with large scale
Data Warehouses based on SQL Server
Blog: dwjunkie.wordpress.com
What is data science?
The Definition
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data engineering,
pattern recognition and learning, advanced
computing, visualization, uncertainty
modeling, data warehousing, and high
performance computing with the goal of
extracting meaning from data and
creating data products.

Source: http://en.wikipedia.org/wiki/Data_science
A brief look into history
GAMBLING –
THAT’S WHERE
EVERYTHING
STARTED
The beginnings of gambling
Gambling exists since 3000 BC
First games based on dices

Origin in China and Mesopotamian
* Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
Scientific foundations
17th century Paradox of
Chevaliers de Méré
LaPlace und Fermat discussed
the paradox in several letters
The beginning of theory of
probability
* Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
The science in Data Science
Calculate probabilities
Pattern recognition
Calculation of analytical variance
Machine Learning
Simulations
Predictions
BI, Data Mining & Prediction
WEATHER
FORECAST
What do companies do today?
Walmart – The pioneer of data analytics

Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
Visa

80% correct prediction of divorces
within the next 5 years
Reason: Divorce is the highest risk
for private insolvency
Source: visa.de
Customers need to find the right case

What do consumers
really do?
Blonde looks
somehow different 

The new washing powder is really great…
Data can be accessed easily…
… but, it‘s hard to analyze it.
Other areas of application
SOCIAL
MEDIA

PRODUCT REMOMMENDATION
RETARGETING

PREDICTIVE
MAINTENANCE

PREDICT RISKS

areas of
application
SALES PREDICTIONS

CUSTOMER ANLYSIS

DYNAMIC PRICING

DISPOSITION
How does this fit to Big Data?
Our starting point…
Structured data

Unstructured data

Harmonize and
generate Information
(Role of „Data Scientist“)

„BIG Data“
Volume, Variety, Velocity
Typical Big Data Architecture
Big Data Analytics

Excel

Big Data Advanced Analytics

PowerPivot
Big Data Preparation (SQL, Map Reduce)

Unstructured data

Structured data
Massive Parallel Processing

Big Data Storage Platform
“[Facebook] started in the Hadoop world. We are now bringing in
relational to enhance that. We're kind of going [in] the other
direction.”
“We've been there, and [we] realized that using the wrong
technology for certain kinds of problems can be difficult. We
started at the end and we're working our way backwards, bringing
in both.”
Ken Rudin,
Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1

Director of Analytics for Facebook
Some word to „R“
• R is a language and environment for statistical
computing and graphics
• R is Open Source under GNU general public license
• Most widely used statistical software
• Everything happens in-memory
• Comes with a package manager (~5000 packages)
• Provides also graphical functionalities
Samples of R
How to approach projects?
Starting Point
Problems, which we know from the BI world already, are further exacerbated by
big data.

•

Complexity of systems constantly grows

•

Amount of data growth exponentially (= Big Data)

•

Need for change is more frequent and is increasingly delving deeper into
business rules

•

Solutions can no longer be thought ahead
Solution Option 1 – Classic Deterministic

Everything can be planned and
design at the drawing board…
How does a system with products & components and their
relationships behaves with each other?

Quelle: Cesar Hidalgo
Solution Option 2 – Learn from „mother Nature“
• How does nature deal with complex non-linear systems?
• Evolution – Variation and selection – „Trial and Error“

„It is not the strongest of the species that
survives, nor the most intelligent but the one
most responsive to change.“ (Charles Darwin)
A candlestick?
45 Iterations

Technology helps, to speed iterations.
Laboratory & Factory
The laboratory

Try & Error
Pattern Recognition
Analytical Apps
An efficient laboratory to experiment
Power Pivot
In-Memory

Microsoft Excel

Power View

Unstructured
Data

Power Query

Source Systems

Power Map

SQL Server

Structured
Data
OleD
B
Odata

WebServer-Logs
Sensor-Data

Data Marketplace

SAP

Databases
Create a Data Science Lab with Microsoft and Open Source tools
Easy to cosume

The factory
Integrated in the business process

Analyze on mass data

Host it and run it

At Enterpise Scale
For Realtime Enterprise
Stable Big Data Architecture
Prediction &
Data Science

Front-Ends &
Mobile
Windows
Azure

On-Premises

Source Systems

Unstructured
Data

WebServer-Logs
Sensor-Data

HDInsight

SQL Server PDW

Data Marketplace

Structured
Data

SAP

Databases
Create a Data Science Lab with Microsoft and Open Source tools
How do we scale?
The battle
How do we scale?
Relational data & compute

SQL Server 2012
Parallel Data
Warehouse
Half Rack

Infiniband

Analytical data &
compute

HP DL 385
40 Cores
2 TB RAM
Fusion-IO Card
What is Revolution Analytics?
• Founded in 2007
• Aim: Evolution of R for high-performance
• Offer R packages for faster performance and
greater stability
• Enterprise & Community products
• Stand-alone, Scale-out (HPC), on Hadoop
How do we handle our data?
R-ODBC: 10 MB/s

Flat file export: 80 MB/s

Data preparation

Data transfer

predictive scripts
Results
• Generate predictions for 30.000 customers
–
–
–
–

•
•
•
•

50.000 rows per customer, 54 columns
Customer goal: 5 Minutes
Our solution: 7.500 customers in 5 Minutes
Benchmark: 1 Minute

Revolution Analytics ODBC driver does not work with PDW
Standard R ODBC driver reads data with 10 MB/s
Workaround via flat file export
RDS format faster than csv
Other solutions?
• R in database
• R on Hadoop
– RHadoop
– Revolution Analytics RHadoop
Other solutions?
• Services & Cloud
THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
Titles are set to 34 pt, Arial
Click to edit Master title style
• Level 1 text is 28 pt Arial
– Level 2 text is 24 pt Arial
• Level 3 text is 20 pt Arial
– Level 4 text is 20 pt Arial
• Level 5 text is 20 pt Arial
Notes (hidden)
• Some speakers may use this slide for hidden
notes
• Please delete if you prefer not to use
• Please note you are also able to use notes
section for each slide

More Related Content

What's hot (20)

PDF
Big Data with SAP HANA Vora
Vigram V
 
PPTX
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Ocean9, Inc.
 
PDF
SAP HANA for Line of Business Sales
SAP Technology
 
PDF
データベースMeetup Vol3
Koji Shinkubo
 
PDF
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
✔ Eric David Benari, PMP
 
PDF
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
PDF
Tarun poladi resume
Tarun P
 
PDF
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
Insight Technology, Inc.
 
PDF
SAP HANA Vora SITMTY 20160707
Henrique Pinto
 
PDF
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Method360
 
PDF
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 
PPTX
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
DOCX
Varadarajan CV
Varadarajan Sourirajan
 
PPTX
Building a Big Data Solution
James Serra
 
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
PPT
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
 
PPTX
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Cesare Cugnasco
 
PPTX
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
PDF
Building a Data Lake on AWS
Gary Stafford
 
DOCX
Integration of SAP HANA with Hadoop
Ramkumar Rajendran
 
Big Data with SAP HANA Vora
Vigram V
 
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Ocean9, Inc.
 
SAP HANA for Line of Business Sales
SAP Technology
 
データベースMeetup Vol3
Koji Shinkubo
 
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
✔ Eric David Benari, PMP
 
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
Tarun poladi resume
Tarun P
 
A11,B24 次世代型インメモリデータベースSAP HANA。その最新技術を理解する by Toshiro Morisaki
Insight Technology, Inc.
 
SAP HANA Vora SITMTY 20160707
Henrique Pinto
 
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Method360
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
Varadarajan CV
Varadarajan Sourirajan
 
Building a Big Data Solution
James Serra
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Cesare Cugnasco
 
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
Building a Data Lake on AWS
Gary Stafford
 
Integration of SAP HANA with Hadoop
Ramkumar Rajendran
 

Viewers also liked (20)

PPTX
Analytic powerhouse parallel data warehouse und r
Marcel Franke
 
PPTX
SAP HANA, Power Pivot, SQL Server – In-memory-Technologien im Vergleich
Marcel Franke
 
PPTX
In Memory-Technologien im Vergleich - SQL Server Konferenz 2015
Marcel Franke
 
PDF
Data science and visualization lab presentation
iHub Research
 
PDF
Founding a Hadoop Data Science Lab
Andre Langevin
 
PDF
Microsoft Data Science Technologies 201505
Mark Tabladillo
 
PPTX
Hacking101 delhi 2013
Jithin Emmanuel
 
DOCX
Acid and base conc
Devonsdeals
 
PDF
Microsoft Data Science Technologies 201608
Mark Tabladillo
 
DOC
Lauric Acid Lab
Frederick High School
 
PPTX
Data science bootcamp day1
Chetan Khatri
 
ODP
States of matter
Siyavula
 
DOC
Lab report for water experiment
Ashwin12345
 
PPTX
Implementing Science Investigations for the CSEC SBA
Debbie-Ann Hall
 
PPTX
Building a scalable data science platform with R
Revolution Analytics
 
DOCX
Leroy sba
leroy walker
 
PDF
Analytics>Forward - Design Thinking for Data Science
Zeydy Ortiz, Ph. D.
 
PDF
Diffusion lab report
leroy walker
 
DOC
How to write a plan and design experiment
Malikah Hypolite
 
PPT
React js
Jai Santhosh
 
Analytic powerhouse parallel data warehouse und r
Marcel Franke
 
SAP HANA, Power Pivot, SQL Server – In-memory-Technologien im Vergleich
Marcel Franke
 
In Memory-Technologien im Vergleich - SQL Server Konferenz 2015
Marcel Franke
 
Data science and visualization lab presentation
iHub Research
 
Founding a Hadoop Data Science Lab
Andre Langevin
 
Microsoft Data Science Technologies 201505
Mark Tabladillo
 
Hacking101 delhi 2013
Jithin Emmanuel
 
Acid and base conc
Devonsdeals
 
Microsoft Data Science Technologies 201608
Mark Tabladillo
 
Lauric Acid Lab
Frederick High School
 
Data science bootcamp day1
Chetan Khatri
 
States of matter
Siyavula
 
Lab report for water experiment
Ashwin12345
 
Implementing Science Investigations for the CSEC SBA
Debbie-Ann Hall
 
Building a scalable data science platform with R
Revolution Analytics
 
Leroy sba
leroy walker
 
Analytics>Forward - Design Thinking for Data Science
Zeydy Ortiz, Ph. D.
 
Diffusion lab report
leroy walker
 
How to write a plan and design experiment
Malikah Hypolite
 
React js
Jai Santhosh
 
Ad

Similar to Create a Data Science Lab with Microsoft and Open Source tools (20)

PDF
The Future of Data Science
DataWorks Summit
 
PDF
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
PDF
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
PPTX
How does Microsoft solve Big Data?
James Serra
 
PDF
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
Denodo
 
PPTX
Innovation med big data – chr. hansens erfaringer
Microsoft
 
PDF
Business in the Driver’s Seat – An Improved Model for Integration
Inside Analysis
 
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
PPTX
Microsoft cloud big data strategy
James Serra
 
PPTX
Coding software and tools used for data science management - Phdassistance
phdAssistance1
 
PPTX
The Future of Data Science
sarith divakar
 
PDF
Trivadis Azure Data Lake
Trivadis
 
PDF
Data Culture Series - Keynote - 16th September 2014
Jonathan Woodward
 
PDF
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
BIWUG
 
PDF
How to build your own Delve: combining machine learning, big data and SharePoint
Joris Poelmans
 
PDF
OpenSistemas Corporate Presentation
OpenSistemas
 
PPSX
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Tomasz Bednarz
 
PDF
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
phdAssistance1
 
PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
MSDEVMTL
 
The Future of Data Science
DataWorks Summit
 
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
How does Microsoft solve Big Data?
James Serra
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
Denodo
 
Innovation med big data – chr. hansens erfaringer
Microsoft
 
Business in the Driver’s Seat – An Improved Model for Integration
Inside Analysis
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Microsoft cloud big data strategy
James Serra
 
Coding software and tools used for data science management - Phdassistance
phdAssistance1
 
The Future of Data Science
sarith divakar
 
Trivadis Azure Data Lake
Trivadis
 
Data Culture Series - Keynote - 16th September 2014
Jonathan Woodward
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
Joris Poelmans
 
OpenSistemas Corporate Presentation
OpenSistemas
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Tomasz Bednarz
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
phdAssistance1
 
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
MSDEVMTL
 
Ad

Recently uploaded (20)

PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PPTX
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Open Source Milvus Vector Database v 2.6
Zilliz
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PDF
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Open Source Milvus Vector Database v 2.6
Zilliz
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 

Create a Data Science Lab with Microsoft and Open Source tools

  • 2. Create a Data Science Lab with Microsoft and Open Source Tools Marcel Franke, pmOne AG, Germany
  • 3. About me – Marcel Franke Practice Lead Advanced Analytics & Data Science pmOne AG – Germany, Austria, Switzerland >10 years experiences with large scale Data Warehouses based on SQL Server Blog: dwjunkie.wordpress.com
  • 4. What is data science?
  • 5. The Definition Data science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Source: http://en.wikipedia.org/wiki/Data_science
  • 6. A brief look into history
  • 8. The beginnings of gambling Gambling exists since 3000 BC First games based on dices Origin in China and Mesopotamian * Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
  • 9. Scientific foundations 17th century Paradox of Chevaliers de Méré LaPlace und Fermat discussed the paradox in several letters The beginning of theory of probability * Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
  • 10. The science in Data Science Calculate probabilities Pattern recognition Calculation of analytical variance Machine Learning Simulations Predictions
  • 11. BI, Data Mining & Prediction
  • 13. What do companies do today?
  • 14. Walmart – The pioneer of data analytics Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
  • 15. Visa 80% correct prediction of divorces within the next 5 years Reason: Divorce is the highest risk for private insolvency Source: visa.de
  • 16. Customers need to find the right case What do consumers really do? Blonde looks somehow different  The new washing powder is really great…
  • 17. Data can be accessed easily…
  • 18. … but, it‘s hard to analyze it.
  • 19. Other areas of application SOCIAL MEDIA PRODUCT REMOMMENDATION RETARGETING PREDICTIVE MAINTENANCE PREDICT RISKS areas of application SALES PREDICTIONS CUSTOMER ANLYSIS DYNAMIC PRICING DISPOSITION
  • 20. How does this fit to Big Data?
  • 21. Our starting point… Structured data Unstructured data Harmonize and generate Information (Role of „Data Scientist“) „BIG Data“ Volume, Variety, Velocity
  • 22. Typical Big Data Architecture Big Data Analytics Excel Big Data Advanced Analytics PowerPivot Big Data Preparation (SQL, Map Reduce) Unstructured data Structured data Massive Parallel Processing Big Data Storage Platform
  • 23. “[Facebook] started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction.” “We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both.” Ken Rudin, Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&[email protected]&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1 Director of Analytics for Facebook
  • 24. Some word to „R“ • R is a language and environment for statistical computing and graphics • R is Open Source under GNU general public license • Most widely used statistical software • Everything happens in-memory • Comes with a package manager (~5000 packages) • Provides also graphical functionalities
  • 26. How to approach projects?
  • 27. Starting Point Problems, which we know from the BI world already, are further exacerbated by big data. • Complexity of systems constantly grows • Amount of data growth exponentially (= Big Data) • Need for change is more frequent and is increasingly delving deeper into business rules • Solutions can no longer be thought ahead
  • 28. Solution Option 1 – Classic Deterministic Everything can be planned and design at the drawing board…
  • 29. How does a system with products & components and their relationships behaves with each other? Quelle: Cesar Hidalgo
  • 30. Solution Option 2 – Learn from „mother Nature“ • How does nature deal with complex non-linear systems? • Evolution – Variation and selection – „Trial and Error“ „It is not the strongest of the species that survives, nor the most intelligent but the one most responsive to change.“ (Charles Darwin)
  • 32. 45 Iterations Technology helps, to speed iterations.
  • 34. The laboratory Try & Error Pattern Recognition Analytical Apps
  • 35. An efficient laboratory to experiment Power Pivot In-Memory Microsoft Excel Power View Unstructured Data Power Query Source Systems Power Map SQL Server Structured Data OleD B Odata WebServer-Logs Sensor-Data Data Marketplace SAP Databases
  • 37. Easy to cosume The factory Integrated in the business process Analyze on mass data Host it and run it At Enterpise Scale For Realtime Enterprise
  • 38. Stable Big Data Architecture Prediction & Data Science Front-Ends & Mobile Windows Azure On-Premises Source Systems Unstructured Data WebServer-Logs Sensor-Data HDInsight SQL Server PDW Data Marketplace Structured Data SAP Databases
  • 40. How do we scale?
  • 42. How do we scale? Relational data & compute SQL Server 2012 Parallel Data Warehouse Half Rack Infiniband Analytical data & compute HP DL 385 40 Cores 2 TB RAM Fusion-IO Card
  • 43. What is Revolution Analytics? • Founded in 2007 • Aim: Evolution of R for high-performance • Offer R packages for faster performance and greater stability • Enterprise & Community products • Stand-alone, Scale-out (HPC), on Hadoop
  • 44. How do we handle our data? R-ODBC: 10 MB/s Flat file export: 80 MB/s Data preparation Data transfer predictive scripts
  • 45. Results • Generate predictions for 30.000 customers – – – – • • • • 50.000 rows per customer, 54 columns Customer goal: 5 Minutes Our solution: 7.500 customers in 5 Minutes Benchmark: 1 Minute Revolution Analytics ODBC driver does not work with PDW Standard R ODBC driver reads data with 10 MB/s Workaround via flat file export RDS format faster than csv
  • 46. Other solutions? • R in database • R on Hadoop – RHadoop – Revolution Analytics RHadoop
  • 48. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm
  • 49. Titles are set to 34 pt, Arial Click to edit Master title style • Level 1 text is 28 pt Arial – Level 2 text is 24 pt Arial • Level 3 text is 20 pt Arial – Level 4 text is 20 pt Arial • Level 5 text is 20 pt Arial
  • 50. Notes (hidden) • Some speakers may use this slide for hidden notes • Please delete if you prefer not to use • Please note you are also able to use notes section for each slide

Editor's Notes

  • #6: A lotoftopicsandskillsarecombinedData Warehouse is also a partofitMore Statisticsandmathematicskillsareneeded
  • #7: Wheredoes Data Science comefrom?
  • #8: Whenyou do someresearch on thattopicyou will automaticallystumbleaboutgamblingorgamesofchances.
  • #9: Dicecup
  • #10: 2 scientistsstartedthinkingaboutgamling on a morescientificway.Writing verylongletters back andforthDifferentprobabilitytowinifyouplaywith 1 diceor 2
  • #11: 1.)Howbigistheprobabilitytowinorloose, ortoreach a certaingoal?2.) Isthereanycorrelationbetweenthecustomerincomeandthesalesamount?5.) Whathappensifwechangecertainparameterslikeprice?6.) Whatisthesalesamoutof a certainproduct in thenextquarteroryear?
  • #12: Howdoesthistopic fit to BI?
  • #13: Whatcan I do withit?
  • #14: So what do companies do withit?I consciouslydidn‘tusetheword Big Data but you all knowthatthisnewareaisveryhot in marketingandnews. So whatarethegoodexamples & usecases?
  • #15: Kasse – cash deskBelohnung – rewardWindel - nappy
  • #23: Stellwert von R herausheben -> fast alle Anbieter basieren auf RWir viel im Bereich Open Source verwendet
  • #32: InjectorforwashingpelletsWaste, poorquality,
  • #36: Ideaof a processmodellcalled Lab & FactoryExperimental approachIterativeFastFind newpatterns
  • #37: Isforthedatascientisttoexperiment
  • #40: Ifwefoundsomethinginteresting, wecandeployittothefactoryIt‘stheplacewherewerunouranalyticalcode at Enterprise scale
  • #43: Mostoftheanalyticaltoolsare out thereforyearslike Databases, R, SAS, SPSSWeoftenherelimitations in scalability & performanceDB -> MPPR, SAS, -> In-Memory
  • #44: POC on different analyticusecaseswiththebigvendorsComplex SQL-QueriesSimulationsPredictionswith R
  • #45: SQL -> wir wissen wie wir skalierenR -> Skalierung schwierig, deshalb Revolution
  • #49: Kein stabiler Markt, viele Möglichkeiten