SlideShare a Scribd company logo
No More Cumbersomeness:
Automatic Predictive Modeling
on Apache Spark
Masato Asahara and Ryohei Fujimaki
NEC Corporation
Jun/07/2017 @Spark Summit 2017
* 9LivesData is our partner for this work.
2 © NEC Corporation 2017
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC System Platform Research Laboratory
Masato is currently leading developments of Spark-based machine learning
and data analytics systems, which fully automate predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 7 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory
Ryohei is research fellow, data science research laboratories, for NEC
Corporation, a leading provider of advanced analytics technologies based on
artificial intelligence.
In addition to technology R&D, Ryohei is also heavily involved with co-
developing cutting-edge advanced analytics solutions with NEC’s global
business clients and partners.
Ryohei received his Ph.D. degree from the University of Tokyo, and
became the youngest research fellow ever in NEC Corporation’s 117-year
history.
3 © NEC Corporation 2017
Agenda
Predictive model
Prediction results
Training Data
Validate Data
Test Data
Yes
No Yes
4 © NEC Corporation 2017
Agenda
Input
NEC Automatic Predictive Modeling
High-speed
Generation
High accurate
predictive model
&
prediction results
Training Data
Validate Data
Test Data
5 © NEC Corporation 2017
Agenda
θ1
θ2
θ3
Automatic Predictive Modeling
7 © NEC Corporation 2017
Enterprise Applications of Predictive Analysis
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2017
Predictive Model Design is “Black-Arts”
Tuning
Best Balance
Feature Selection
Algorithm Selection
Accuracy v s Transparency
Prediction
Black box
Input
Data
Input
Data
White box
Prediction
Determine a Set of Features
Sales = f(Price, Location)
Sales = f(Price, Weather)
or
Take time …
9 © NEC Corporation 2017
NEC Automatic Predictive Modeling
Tuning
Best Balance
Feature Selection
Algorithm Selection
Accuracy v s Transparency
Prediction
Black box
Input
Data
Input
Data
White box
Prediction
Determine a Set of Features
Sales = f(Price, Location)
Sales = f(Price, Weather)
or
10 © NEC Corporation 2017
Explore Massive Modeling Possibilities
Algorithms
Yes
No Yes
Parameters
Data
Preprocessing
Strategies
Yes
No Yes
Feature
Selection!
11 © NEC Corporation 2017
Automate and Accelerate by Apache Spark!
Algorithms
Yes
No Yes
Parameters
Data
Preprocessing
Strategies
Complete in hours!
Yes
No Yes
12 © NEC Corporation 2017
Modeling and Prediction Flow
Training
Data
Validate
Data
Train
models
Validate
models
Models
Test
Data
Predict
Best model
Validation
criteria
Best
Prediction
Design Challenges and Solutions
14 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
15 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
16 © NEC Corporation 2017
Implementation Gap between Spark and ML engines
▌Distributed memory
architecture
▌MapReduce computation
model
▌Scala on JVM
▌Single or shared memory
architecture
▌Standard computing
framework
▌High-speed native binary
code
17 © NEC Corporation 2017
Machine Learning
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Training
Data
(Parquet)
HDFS
HDFS
Models
Yes
No Yes
Yes
No Yes
RDD
element
RDD
element
RDD
element
18 © NEC Corporation 2017
Validation
(MapReduce)
Predict
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Validate
Data
(Parquet)
HDFS
HDFS
Best
Model
RDD
element
RDD
element
RDD
element
19 © NEC Corporation 2017
Predict
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Test Data
(Parquet)
HDFS
HDFS
Prediction
Results
(Parquet)
RDD
element
RDD
element
RDD
element
20 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
21 © NEC Corporation 2017
Naive Implementation requires Multiple Data Load & Convert
Load &
Convert
Waste of memory
Load of data
from other servers
Multiple data convert to
matrix
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
22 © NEC Corporation 2017
Parameter-aware Scheduling
Efficient memory usage
Low load of data
from other hosts
Low frequency of
data convert to matrix
Parameter θ1
Parameter θ2
Parameter θ3
Matrix X1
23 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
24 © NEC Corporation 2017
Major Part of Computing is Machine Learning
Machine Learning
(Map operation)
Convert
to
Matrix
Data Preprocessing
(MapReduce)
Training
Data
(Parquet)
HDFS
HDFS
Major part of total
execution time!
Yes
No Yes
25 © NEC Corporation 2017
Bad Scheduling degrades Execution Efficiency
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
26 © NEC Corporation 2017
Predictive Scheduling increases Performance Scalability
Balance complex model learning and simple one
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪~
♪~
27 © NEC Corporation 2017
Tip of Spark on YARN for Stable Execution
Default Configuration sometimes fails Execution w/ Enough Memory…
We serve much much memory to
Spark but it continues failing.
Why!?!?
Spark WebUI says …
28 © NEC Corporation 2017
Spark on YARN Tips for Stable Execution
… Because JVM system memory spikes over than YARN limitation suddenly (*)
YARN limitation
(6GB)
Spike of JVM system
memory usage
Time
Memory(GB)
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit,” Spark Summit 2016.
29 © NEC Corporation 2017
Spark on YARN Tips for Stable Execution
Tips: We have to carefully configure ‘spark.yarn.*.memoryOverhead’
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
15% overhead was required in our case…
Optimized overhead memory configuration is our future work
Evaluation
31 © NEC Corporation 2017
Evaluation Setup
• Compare accuracy with manual predictive modeling
• Measure execution time
▌Prediction problem
Targeting top-10% of potential
positive sample
▌Manual predictive modeling
Done with scikit-learn v0.18.1
All parameters are set with default
values
• except RandamForest (n_estimators
= 200)
▌Data set
KDDCUP 2014 competition data
• 557K records for training and validate data
• 62K records for test data
• Features: 500
KDDCUP 2015 competition data
• 108K records for training and validate data
• 12K records for test data
• Features: 500
IJCAI 2015 competition data
• 87K records for training, validate and test
data
• Features: 500
32 © NEC Corporation 2017
Cluster Spec.
▌Size: 3U!
▌# Server modules: 34
▌CPU: 272 Intel Xeon-D 2.1GHz cores
(128 cores used in the evaluation)
▌MEM: 2TB
High memory access: marked 16x score of Terasort
of HiBench v5 compared to 3 x 1U servers!
▌Storage: 34TB SSD
▌Network: 10GbE for internal network
▌Spark v1.6.0, Hadoop v2.7.3
Scalable Modular Server
(DX2000)
http://www.nec.com/en/global/prod/dxseries/dx2000/product/index.html
33 © NEC Corporation 2017
Precision and Execution Time
NEC Automatic Predictive Modeling produces competitively-accurate
models automatically in hours!
data NEC Logistic Regression SVM Random Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% Precision
data NEC
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time
34 © NEC Corporation 2017
Summary
θ1
θ2
θ3
35 © NEC Corporation 2017
Future work
Speed up by
FPGA, Vector processors
Extend to other models
(e.g., DeepLearning)
Reduce YARN
memory overhead
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki
Appendix
38 © NEC Corporation 2017
Automation Engine
Software Stack
Computing Infrastructure
Data Store / Resource Management
In-Memory Distributed Computing
Web UI
Native-speed
Machine Learning
Libraries
39 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Server Module
Form factor Server module that plugs into the Module Enclosure
Number of Processors 1
Processors Intel® Xeon® Processor D-1527(2.20GHz/4-core/6MB)
Intel® Xeon® Processor D-1541(2.10GHz/8-core/12MB)
Intel® Xeon® Processor D-1571(1.30GHz/16-core/24MB)
Memory type DDR4-2133 ECC SO-DIMM
Memory slots 4
Memory capacity 16 GB / 32 GB / 64 GB
Storage type M.2 SATA SSD
Internal storage capacity 128 GB / 256 GB / 512 GB / 1TB
Network 2 10GbE links to switch modules
2 additional 10GbE links to switch modules with an optional 10G LAN
module (Occupies one server module slot)
Systems management EXPRESSSCOPE Engine 3
Operating systems and
virtualization software
Red Hat® Enterprise Linux® 7.2 / 6.8
VMware ESXi™ 6.0
Microsoft® Windows® Server 2012 R2
CentOS 7.2
Ubuntu14.04LTS / 16.04LTS
40 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Module Enclosure
Form factor / height 3U Rack
Server module slots 44 (22 slots can be used for 10G LAN modules and 8 slots can be used
for PCIe cards)
* The number of installable modules may change according to
configuration.
Network switches 2 network switch modules (L2)
Redundant cooling fan Standard, hot plug
Power supplies 3 Hot plug power supply 1600 Watt
200-240 VAC ± 10% 50 / 60 Hz ± 3 Hz
Redundant power supply 2+1 redundant, hot plug
Temperature and humidity
conditions (non-condensing)
Operating: 10 to 35* °C/ 50 to 95* °F , 20 to 80%
Non-operating: -10 to 55°C/14 to 131°F, 20 to 80%
* In specific configurations, the operable ambient temperature is up to
40°C/104°F
Dimensions (W x D x H) and
maximum weight
448.0 x 769.0 x 130.0 mm / 17.6 x 30.3 x 5.1 in, 48 kg / 105.82 lbs
41 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Network Switch Module (L2)
Form factor Network Switch Module that plugs into the Module
Enclosure
Network Up link: 8 40G QSFP+ ports plus 1 1000BASE-T for
management
Down link: 44 10GbE

More Related Content

PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
PDF
Accelerating Data Science with Better Data Engineering on Databricks
PDF
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Building Reliable Data Lakes at Scale with Delta Lake
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Accelerating Data Science with Better Data Engineering on Databricks
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Building Reliable Data Lakes at Scale with Delta Lake
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...

What's hot (20)

PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
PDF
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
PDF
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
PDF
Object Detection with Transformers
PDF
What's New in Upcoming Apache Spark 2.3
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Apache Spark for Cyber Security in an Enterprise Company
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Advanced Natural Language Processing with Apache Spark NLP
PDF
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
PDF
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Jump Start on Apache® Spark™ 2.x with Databricks
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
What's New in Apache Spark 2.3 & Why Should You Care
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Object Detection with Transformers
What's New in Upcoming Apache Spark 2.3
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Apache Spark for Cyber Security in an Enterprise Company
Announcing Databricks Cloud (Spark Summit 2014)
Jump Start with Apache Spark 2.0 on Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Advanced Natural Language Processing with Apache Spark NLP
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Ad

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki (20)

PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
PDF
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
PDF
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
PPTX
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
PDF
Graph Data Science at Scale
PDF
TensorFlow 16: Building a Data Science Platform
PDF
Big Data Heterogeneous Mixture Learning on Spark
PPTX
Microsoft Azure Stack in Tunisia
PDF
The Fast Path to Building Operational Applications with Spark
PDF
IBM Power leading Cognitive Systems
PDF
Distributed Heterogeneous Mixture Learning On Spark
PDF
Performance Characterization and Optimization of In-Memory Data Analytics on ...
PDF
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
PDF
Trends towards the merge of HPC + Big Data systems
PDF
Cisco connect montreal 2018 compute v final
PDF
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
PPTX
Ibm symp14 referentin_barbara koch_power_8 launch bk
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Graph Data Science at Scale
TensorFlow 16: Building a Data Science Platform
Big Data Heterogeneous Mixture Learning on Spark
Microsoft Azure Stack in Tunisia
The Fast Path to Building Operational Applications with Spark
IBM Power leading Cognitive Systems
Distributed Heterogeneous Mixture Learning On Spark
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Trends towards the merge of HPC + Big Data systems
Cisco connect montreal 2018 compute v final
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to Data Science and Data Analysis
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Managing Community Partner Relationships
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Modelling in Business Intelligence , information system
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
How to run a consulting project- client discovery
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Lecture1 pattern recognition............
ISS -ESG Data flows What is ESG and HowHow
Introduction to Data Science and Data Analysis
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Managing Community Partner Relationships
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Qualitative Qantitative and Mixed Methods.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Modelling in Business Intelligence , information system
Data_Analytics_and_PowerBI_Presentation.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
How to run a consulting project- client discovery
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Lecture1 pattern recognition............

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki

  • 1. No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Masato Asahara and Ryohei Fujimaki NEC Corporation Jun/07/2017 @Spark Summit 2017 * 9LivesData is our partner for this work.
  • 2. 2 © NEC Corporation 2017 Who we are? ▌Masato Asahara (Ph.D.) ▌Researcher, NEC System Platform Research Laboratory Masato is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 7 years as a researcher in the field of distributed computing systems and computing resource management technologies. ▌Ryohei Fujimaki (Ph.D.) ▌Research Fellow, NEC Data Science Research Laboratory Ryohei is research fellow, data science research laboratories, for NEC Corporation, a leading provider of advanced analytics technologies based on artificial intelligence. In addition to technology R&D, Ryohei is also heavily involved with co- developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners. Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 117-year history.
  • 3. 3 © NEC Corporation 2017 Agenda Predictive model Prediction results Training Data Validate Data Test Data Yes No Yes
  • 4. 4 © NEC Corporation 2017 Agenda Input NEC Automatic Predictive Modeling High-speed Generation High accurate predictive model & prediction results Training Data Validate Data Test Data
  • 5. 5 © NEC Corporation 2017 Agenda θ1 θ2 θ3
  • 7. 7 © NEC Corporation 2017 Enterprise Applications of Predictive Analysis Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8. 8 © NEC Corporation 2017 Predictive Model Design is “Black-Arts” Tuning Best Balance Feature Selection Algorithm Selection Accuracy v s Transparency Prediction Black box Input Data Input Data White box Prediction Determine a Set of Features Sales = f(Price, Location) Sales = f(Price, Weather) or Take time …
  • 9. 9 © NEC Corporation 2017 NEC Automatic Predictive Modeling Tuning Best Balance Feature Selection Algorithm Selection Accuracy v s Transparency Prediction Black box Input Data Input Data White box Prediction Determine a Set of Features Sales = f(Price, Location) Sales = f(Price, Weather) or
  • 10. 10 © NEC Corporation 2017 Explore Massive Modeling Possibilities Algorithms Yes No Yes Parameters Data Preprocessing Strategies Yes No Yes Feature Selection!
  • 11. 11 © NEC Corporation 2017 Automate and Accelerate by Apache Spark! Algorithms Yes No Yes Parameters Data Preprocessing Strategies Complete in hours! Yes No Yes
  • 12. 12 © NEC Corporation 2017 Modeling and Prediction Flow Training Data Validate Data Train models Validate models Models Test Data Predict Best model Validation criteria Best Prediction
  • 14. 14 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 15. 15 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 16. 16 © NEC Corporation 2017 Implementation Gap between Spark and ML engines ▌Distributed memory architecture ▌MapReduce computation model ▌Scala on JVM ▌Single or shared memory architecture ▌Standard computing framework ▌High-speed native binary code
  • 17. 17 © NEC Corporation 2017 Machine Learning (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Training Data (Parquet) HDFS HDFS Models Yes No Yes Yes No Yes RDD element RDD element RDD element
  • 18. 18 © NEC Corporation 2017 Validation (MapReduce) Predict (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Validate Data (Parquet) HDFS HDFS Best Model RDD element RDD element RDD element
  • 19. 19 © NEC Corporation 2017 Predict (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Test Data (Parquet) HDFS HDFS Prediction Results (Parquet) RDD element RDD element RDD element
  • 20. 20 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 21. 21 © NEC Corporation 2017 Naive Implementation requires Multiple Data Load & Convert Load & Convert Waste of memory Load of data from other servers Multiple data convert to matrix Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  • 22. 22 © NEC Corporation 2017 Parameter-aware Scheduling Efficient memory usage Low load of data from other hosts Low frequency of data convert to matrix Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  • 23. 23 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 24. 24 © NEC Corporation 2017 Major Part of Computing is Machine Learning Machine Learning (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Training Data (Parquet) HDFS HDFS Major part of total execution time! Yes No Yes
  • 25. 25 © NEC Corporation 2017 Bad Scheduling degrades Execution Efficiency 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes
  • 26. 26 © NEC Corporation 2017 Predictive Scheduling increases Performance Scalability Balance complex model learning and simple one 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~
  • 27. 27 © NEC Corporation 2017 Tip of Spark on YARN for Stable Execution Default Configuration sometimes fails Execution w/ Enough Memory… We serve much much memory to Spark but it continues failing. Why!?!? Spark WebUI says …
  • 28. 28 © NEC Corporation 2017 Spark on YARN Tips for Stable Execution … Because JVM system memory spikes over than YARN limitation suddenly (*) YARN limitation (6GB) Spike of JVM system memory usage Time Memory(GB) (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit,” Spark Summit 2016.
  • 29. 29 © NEC Corporation 2017 Spark on YARN Tips for Stable Execution Tips: We have to carefully configure ‘spark.yarn.*.memoryOverhead’ (http://spark.apache.org/docs/2.1.1/running-on-yarn.html) 15% overhead was required in our case… Optimized overhead memory configuration is our future work
  • 31. 31 © NEC Corporation 2017 Evaluation Setup • Compare accuracy with manual predictive modeling • Measure execution time ▌Prediction problem Targeting top-10% of potential positive sample ▌Manual predictive modeling Done with scikit-learn v0.18.1 All parameters are set with default values • except RandamForest (n_estimators = 200) ▌Data set KDDCUP 2014 competition data • 557K records for training and validate data • 62K records for test data • Features: 500 KDDCUP 2015 competition data • 108K records for training and validate data • 12K records for test data • Features: 500 IJCAI 2015 competition data • 87K records for training, validate and test data • Features: 500
  • 32. 32 © NEC Corporation 2017 Cluster Spec. ▌Size: 3U! ▌# Server modules: 34 ▌CPU: 272 Intel Xeon-D 2.1GHz cores (128 cores used in the evaluation) ▌MEM: 2TB High memory access: marked 16x score of Terasort of HiBench v5 compared to 3 x 1U servers! ▌Storage: 34TB SSD ▌Network: 10GbE for internal network ▌Spark v1.6.0, Hadoop v2.7.3 Scalable Modular Server (DX2000) http://www.nec.com/en/global/prod/dxseries/dx2000/product/index.html
  • 33. 33 © NEC Corporation 2017 Precision and Execution Time NEC Automatic Predictive Modeling produces competitively-accurate models automatically in hours! data NEC Logistic Regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% Precision data NEC KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  • 34. 34 © NEC Corporation 2017 Summary θ1 θ2 θ3
  • 35. 35 © NEC Corporation 2017 Future work Speed up by FPGA, Vector processors Extend to other models (e.g., DeepLearning) Reduce YARN memory overhead
  • 38. 38 © NEC Corporation 2017 Automation Engine Software Stack Computing Infrastructure Data Store / Resource Management In-Memory Distributed Computing Web UI Native-speed Machine Learning Libraries
  • 39. 39 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Server Module Form factor Server module that plugs into the Module Enclosure Number of Processors 1 Processors Intel® Xeon® Processor D-1527(2.20GHz/4-core/6MB) Intel® Xeon® Processor D-1541(2.10GHz/8-core/12MB) Intel® Xeon® Processor D-1571(1.30GHz/16-core/24MB) Memory type DDR4-2133 ECC SO-DIMM Memory slots 4 Memory capacity 16 GB / 32 GB / 64 GB Storage type M.2 SATA SSD Internal storage capacity 128 GB / 256 GB / 512 GB / 1TB Network 2 10GbE links to switch modules 2 additional 10GbE links to switch modules with an optional 10G LAN module (Occupies one server module slot) Systems management EXPRESSSCOPE Engine 3 Operating systems and virtualization software Red Hat® Enterprise Linux® 7.2 / 6.8 VMware ESXi™ 6.0 Microsoft® Windows® Server 2012 R2 CentOS 7.2 Ubuntu14.04LTS / 16.04LTS
  • 40. 40 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Module Enclosure Form factor / height 3U Rack Server module slots 44 (22 slots can be used for 10G LAN modules and 8 slots can be used for PCIe cards) * The number of installable modules may change according to configuration. Network switches 2 network switch modules (L2) Redundant cooling fan Standard, hot plug Power supplies 3 Hot plug power supply 1600 Watt 200-240 VAC ± 10% 50 / 60 Hz ± 3 Hz Redundant power supply 2+1 redundant, hot plug Temperature and humidity conditions (non-condensing) Operating: 10 to 35* °C/ 50 to 95* °F , 20 to 80% Non-operating: -10 to 55°C/14 to 131°F, 20 to 80% * In specific configurations, the operable ambient temperature is up to 40°C/104°F Dimensions (W x D x H) and maximum weight 448.0 x 769.0 x 130.0 mm / 17.6 x 30.3 x 5.1 in, 48 kg / 105.82 lbs
  • 41. 41 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Network Switch Module (L2) Form factor Network Switch Module that plugs into the Module Enclosure Network Up link: 8 40G QSFP+ ports plus 1 1000BASE-T for management Down link: 44 10GbE