SlideShare a Scribd company logo
Data Science Models on Big
Data Platforms
Engineering Patterns for Implementing
Hisham Arafat
Digital Transformation Lead Consultant
Solutions Architect, Technology Strategist & Researcher
Riyadh, KSA – 31 January 2017
http://www.visualcapitalist.com/what-happens-internet-minute-2016/
Big Data…Practical Definition!
• Big Data is the challenge not the solution
• Big Data technologies address that
challenge
• Practically:
• Massive Streams
• Unstructured
• Complex Processing
Let’s Have a Use Case…Social Marketing
Social Marketing…Looks Simple!
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
It’s NOT as Easy as it’s Looks Like!
Not Only Building Appropriate Model, but
More Into
Designing a Solution…Engineering Factors
• Interfacing with sources: REST APIs, source HTML,… (text is assumed)
• Parsing to extract: queries, Regular Expressions,…
• Crawling frequency: every 1 minute, 1 hour, on event,…
• Document structure: post, post + comments, #, Reach, Retweets,…
• Metadata: time, date, source, tags, authoritativeness,…
• Transformations: canonicalization, weights, tokenization,…
- Size: average size of 2 KB / doc
- Initial load: 1.5B doc
- Frequency: every 5 minutes
- Throughput: 2 KB * 60,000 doc
= 120 MB / load
- Grows per day ~ 34 GB
Engineering Factors
• Input format: text, encoded text,…
• Document representation: bag of words, ontology…
• Corpus structures: indexes, reverse indexes,…
• Corpus metrics: doc frequency, inverse doc frequency,…
• Preprocessing: annotation, tagging,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Processing window: 60K per 3 mins
- Processing rate: 20K doc per min
- Final doc size = 2KB * 5 ~ 10KB
- Scan rate: 20k * 10KB min ~ 200MB/min
- Many overheads need to be added
Engineering Factors
• Dimensionality reduction: stemming, lemmatization, noisy words…
• Type of applications: search/retrieval, sentiment analysis…
• Modeling methods: classifiers, topic modeling, relevance…
• Model efficiency: confusion metrics, precision, recall…
• Overheads: intermediate processing, pre-aggregation,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Search for “LemaTea sweet taste”
- No of tf to calculate ~ 1.5B * 3 ~ 4.5B
- No of idf to calculate ~ 1.5B
- Total calculations for 1 search ~ 6 B
- Consider daily growth
Engineering Factors
• Files structure: tables, text files, files-day,…
• Files formats: HDFS, parquet, avro…
• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…
• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML…
• Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming…
• Ingestion pattern: real-time, micro batches,…
- Overall Storage
- Processing capacity per node
- No of nodes
- Tables  Hive, Hbase, Greenplum
- Individual files  Spark, Flink
- Files-day  Hadoop HDFS
Engineering Factors
• Workload: no of requests, request size,…
• Application performance: response time, concurrent requests…
• Applications interfacing: RESET APIs, native, messaging,…
• Application implementation: integration, model scoring,…
• Security model: application level, platform level,…
- For 3 search terms ~ 6B calculations
- For 5 search terms ~ 9B calculations
- For 10 concurrent requests ~ 75B
- Resource queuing / prioritization
- Search options like date range
- Access control model
Engineering Factors
Ongoing Process…Growing Requirements
What if?
• New sources are included
• Wider parsing Criteria
• Advanced modeling: POS, Word Co-
occurrence, Co-referencing, Named
Entity, Relationship Extraction,…
• Better response time is needed
• More frequent ingestion
Dynamic
Platform
Ingestion
Corpus
Processing
Model
Processing
Requests
Processing
• Larger number of docs
• Increased processing requirements
• Platform expansion
• Overall architecture reconsidered
Some Building Blocks
What is a Data Science Model?
• Type & format of inputs date
• Data ingestion
• Transformations and feature engineering
• Modeling methods and algorithms
• Model evaluation and scoring
• Applications implantations considerations
• In-Memory vs. In-Database
Key Challenges for Data Science Models
Volume
Stationary
Batches
Structured
Insights
Growth
Streams
Real-time
Unstructured
Responsive
Scale out Performance
Data Flow Engines
Event Processing
Complex Formats
Perspective / Deep Models
Traditional Data Management Systems
• Shared I/O
• Shared Processing
• Limited Scalability
• Service Bottlenecks
• High Cost Factor
SharedBuffers
Data Files
Database
Cluster
I/O
I/O
I/O
Network
DatabaseService
Abstraction of Big Data Platforms Data Nodes
Master Nodes
I/O
Network
Interconnect
• Parallel Processing
• Shared Nothing
• Linear Scalability
• Distributed Services
• Lower Cost Factor
I/O
I/O
I/O
…
Metadata
1
2
3
n
Metadata
User data / Replicas
User data / Replicas
User data / Replicas
User data / Replicas
In a Nutshell
Source:
http://dataconomy.com
/2014/06/understandi
ng-big-data-ecosystem/
• Very huge.
• Overlaps.
• Overloading.
• You need to
start with a use
case to be able
to get your
solutions well
engineered.
Engineered Systems
• Packaged: Hortonworks – Pivotal – Cloudera
• Appliances: EMC DCA – Dell DSSD – Dell VxRack
• Cloud offerings: Azure – AWS – IBM – Google Cloud
Engineering Patterns in
Implementation
Lambda Architecture…Social Marketing
• Generic, scalable and
fault-tolerant data
processing architecture.
• Keeps a master
immutable dataset
while serving low
latency requests.
• Aims at providing linear
scalability.
Source: http://lambda-architecture.net/
Social Marketing…Revisted
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Sequence Files
Apache Spark / MLlib
• In memory distributed
Processing
• Scala, Python, Java and R
• Resilient Distributed
Dataset (RDD)
• Mllib – Machine Learning
Algorithms
• SQL and Data Frames /
Pipelines
• Streaming
• Big Graph analytics
Spark Cluster Mesos HDFS/YARN
Apache Spark
• Supports different
types of Cluster
Managers
• HDFS / YARN,
Mesos, Amazon S3,
Stand Alone,
Hbase, Casandra…
• Interactive vs
Application Mode
• Memory
Optimization
Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
Apache Spark
Apache Spark MLlib
Apache Spark…The Big Picture
Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
Greenplum / MADLib
• Massively Parallel
Processing
• Shared Nothing
• Table distribution
• By Key
• By Round Robin
• Massively Parallel
Data Loading
• Integration with
Hadoop
• Native MapReduce
Apache MADLib
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-
massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Take Aways
• A Data Science is not just the algorithms but it includes and end-to-end
solution.
• The implementation should consider engineering factors and quantify them
so appropriate components can be selected.
• The Big Data technology land scape is really huge and growing – start with a
solid use case to identify potential components.
• Abstraction of specific technology will enable you to put your hands on the
pros and cons.
• Creativity in solutions design and technology selection case by case.
• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL
Kafka, Hadoop / Yarn, Greenplum, MADLib.
Q & A
Email: hiarafat@hotmail.com
Skype: hichawy
LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230
Thank You

More Related Content

What's hot (20)

PPTX
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB
 
PPTX
Big Data Use Cases
boorad
 
PDF
Webinar: Is Spark Hadoop's Friend or Foe?
Zaloni
 
PDF
Architecture of Big Data Solutions
Guido Schmutz
 
PDF
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
 
PDF
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
PDF
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Mark Rittman
 
PDF
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Mark Rittman
 
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
PDF
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
PPTX
Introduction To Big Data & Hadoop
Blackvard
 
PDF
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Mark Rittman
 
PDF
The importance of efficient data management for Digital Transformation
MongoDB
 
PPTX
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Dataconomy Media
 
PDF
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
 
PPTX
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
PDF
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
PDF
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB
 
Big Data Use Cases
boorad
 
Webinar: Is Spark Hadoop's Friend or Foe?
Zaloni
 
Architecture of Big Data Solutions
Guido Schmutz
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
 
Architecture for Real-Time and Batch Big Data Analytics
Nir Rubinstein
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
Mark Rittman
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Mark Rittman
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
Introduction To Big Data & Hadoop
Blackvard
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Mark Rittman
 
The importance of efficient data management for Digital Transformation
MongoDB
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Dataconomy Media
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
 
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 

Viewers also liked (19)

PPTX
Complex Models for Big Data
Data Science Research Center
 
PDF
Building new business models through big data dec 06 2012
Aki Balogh
 
PDF
Data Science Highlights
Joe Lamantia
 
PDF
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
PDF
Linear models for data science
Brad Klingenberg
 
PPTX
Becoming Data-Driven Through Cultural Change
Cloudera, Inc.
 
PPTX
From Insight to Action: Using Data Science to Transform Your Organization
Cloudera, Inc.
 
PPTX
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 
PPTX
DataScience and BigData Cebu 1st meetup
Francisco Liwa
 
PPTX
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
 
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
PPTX
How to create new business models with Big Data and Analytics
Aki Balogh
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
PDF
Pivotal Cloud Foundry: A Technical Overview
VMware Tanzu
 
PPTX
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
PDF
Tips for data science competitions
Owen Zhang
 
PDF
Big Data in Retail - Examples in Action
David Pittman
 
PDF
Analytics Trends 2016: The next evolution
Deloitte United States
 
Complex Models for Big Data
Data Science Research Center
 
Building new business models through big data dec 06 2012
Aki Balogh
 
Data Science Highlights
Joe Lamantia
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
Linear models for data science
Brad Klingenberg
 
Becoming Data-Driven Through Cultural Change
Cloudera, Inc.
 
From Insight to Action: Using Data Science to Transform Your Organization
Cloudera, Inc.
 
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 
DataScience and BigData Cebu 1st meetup
Francisco Liwa
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
How to create new business models with Big Data and Analytics
Aki Balogh
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
Pivotal Cloud Foundry: A Technical Overview
VMware Tanzu
 
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
Tips for data science competitions
Owen Zhang
 
Big Data in Retail - Examples in Action
David Pittman
 
Analytics Trends 2016: The next evolution
Deloitte United States
 
Ad

Similar to Engineering patterns for implementing data science models on big data platforms (20)

PDF
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
PDF
Ncku csie talk about Spark
Giivee The
 
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
PDF
Big Data , Big Problem?
Mohammadhasan Farazmand
 
PDF
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Osama Khan
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PDF
What's new with Apache Spark?
Paco Nathan
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
Ncku csie talk about Spark
Giivee The
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
Big Data , Big Problem?
Mohammadhasan Farazmand
 
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
Bds session 13 14
Infinity Tech Solutions
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Osama Khan
 
Big data processing with apache spark
sarith divakar
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
What's new with Apache Spark?
Paco Nathan
 
Started with-apache-spark
Happiest Minds Technologies
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Introduction to Spark Training
Spark Summit
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Big Data Architecture
Guido Schmutz
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Unified Big Data Processing with Apache Spark
C4Media
 
Ad

Recently uploaded (20)

PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
PDF
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
DOCX
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PPTX
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Starbucks in the Indian market through its joint venture.
sales480687
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
Kafka Use Cases Real-World Applications
Accentfuture
 
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
The Influence off Flexible Work Policies
sales480687
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
Predicting Titanic Survival Presentation
praxyfarhana
 
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 

Engineering patterns for implementing data science models on big data platforms

  • 1. Data Science Models on Big Data Platforms Engineering Patterns for Implementing Hisham Arafat Digital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher Riyadh, KSA – 31 January 2017
  • 2. http://www.visualcapitalist.com/what-happens-internet-minute-2016/ Big Data…Practical Definition! • Big Data is the challenge not the solution • Big Data technologies address that challenge • Practically: • Massive Streams • Unstructured • Complex Processing
  • 3. Let’s Have a Use Case…Social Marketing
  • 4. Social Marketing…Looks Simple! Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 5. Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users
  • 6. It’s NOT as Easy as it’s Looks Like!
  • 7. Not Only Building Appropriate Model, but More Into Designing a Solution…Engineering Factors
  • 8. • Interfacing with sources: REST APIs, source HTML,… (text is assumed) • Parsing to extract: queries, Regular Expressions,… • Crawling frequency: every 1 minute, 1 hour, on event,… • Document structure: post, post + comments, #, Reach, Retweets,… • Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,… - Size: average size of 2 KB / doc - Initial load: 1.5B doc - Frequency: every 5 minutes - Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB Engineering Factors
  • 9. • Input format: text, encoded text,… • Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,… • Corpus metrics: doc frequency, inverse doc frequency,… • Preprocessing: annotation, tagging,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Processing window: 60K per 3 mins - Processing rate: 20K doc per min - Final doc size = 2KB * 5 ~ 10KB - Scan rate: 20k * 10KB min ~ 200MB/min - Many overheads need to be added Engineering Factors
  • 10. • Dimensionality reduction: stemming, lemmatization, noisy words… • Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance… • Model efficiency: confusion metrics, precision, recall… • Overheads: intermediate processing, pre-aggregation,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Search for “LemaTea sweet taste” - No of tf to calculate ~ 1.5B * 3 ~ 4.5B - No of idf to calculate ~ 1.5B - Total calculations for 1 search ~ 6 B - Consider daily growth Engineering Factors
  • 11. • Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro… • Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,… • Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming… • Ingestion pattern: real-time, micro batches,… - Overall Storage - Processing capacity per node - No of nodes - Tables  Hive, Hbase, Greenplum - Individual files  Spark, Flink - Files-day  Hadoop HDFS Engineering Factors
  • 12. • Workload: no of requests, request size,… • Application performance: response time, concurrent requests… • Applications interfacing: RESET APIs, native, messaging,… • Application implementation: integration, model scoring,… • Security model: application level, platform level,… - For 3 search terms ~ 6B calculations - For 5 search terms ~ 9B calculations - For 10 concurrent requests ~ 75B - Resource queuing / prioritization - Search options like date range - Access control model Engineering Factors
  • 13. Ongoing Process…Growing Requirements What if? • New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co- occurrence, Co-referencing, Named Entity, Relationship Extraction,… • Better response time is needed • More frequent ingestion Dynamic Platform Ingestion Corpus Processing Model Processing Requests Processing • Larger number of docs • Increased processing requirements • Platform expansion • Overall architecture reconsidered
  • 15. What is a Data Science Model? • Type & format of inputs date • Data ingestion • Transformations and feature engineering • Modeling methods and algorithms • Model evaluation and scoring • Applications implantations considerations • In-Memory vs. In-Database
  • 16. Key Challenges for Data Science Models Volume Stationary Batches Structured Insights Growth Streams Real-time Unstructured Responsive Scale out Performance Data Flow Engines Event Processing Complex Formats Perspective / Deep Models
  • 17. Traditional Data Management Systems • Shared I/O • Shared Processing • Limited Scalability • Service Bottlenecks • High Cost Factor SharedBuffers Data Files Database Cluster I/O I/O I/O Network DatabaseService
  • 18. Abstraction of Big Data Platforms Data Nodes Master Nodes I/O Network Interconnect • Parallel Processing • Shared Nothing • Linear Scalability • Distributed Services • Lower Cost Factor I/O I/O I/O … Metadata 1 2 3 n Metadata User data / Replicas User data / Replicas User data / Replicas User data / Replicas
  • 19. In a Nutshell Source: http://dataconomy.com /2014/06/understandi ng-big-data-ecosystem/ • Very huge. • Overlaps. • Overloading. • You need to start with a use case to be able to get your solutions well engineered.
  • 20. Engineered Systems • Packaged: Hortonworks – Pivotal – Cloudera • Appliances: EMC DCA – Dell DSSD – Dell VxRack • Cloud offerings: Azure – AWS – IBM – Google Cloud
  • 22. Lambda Architecture…Social Marketing • Generic, scalable and fault-tolerant data processing architecture. • Keeps a master immutable dataset while serving low latency requests. • Aims at providing linear scalability. Source: http://lambda-architecture.net/
  • 23. Social Marketing…Revisted Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 24. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 25. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 26. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark Sequence Files
  • 27. Apache Spark / MLlib • In memory distributed Processing • Scala, Python, Java and R • Resilient Distributed Dataset (RDD) • Mllib – Machine Learning Algorithms • SQL and Data Frames / Pipelines • Streaming • Big Graph analytics Spark Cluster Mesos HDFS/YARN
  • 28. Apache Spark • Supports different types of Cluster Managers • HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra… • Interactive vs Application Mode • Memory Optimization Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
  • 31. Apache Spark…The Big Picture Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
  • 32. Greenplum / MADLib • Massively Parallel Processing • Shared Nothing • Table distribution • By Key • By Round Robin • Massively Parallel Data Loading • Integration with Hadoop • Native MapReduce
  • 34. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to- massively-parallel-in-database-image-processing-part-1
  • 35. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 36. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 37. Take Aways • A Data Science is not just the algorithms but it includes and end-to-end solution. • The implementation should consider engineering factors and quantify them so appropriate components can be selected. • The Big Data technology land scape is really huge and growing – start with a solid use case to identify potential components. • Abstraction of specific technology will enable you to put your hands on the pros and cons. • Creativity in solutions design and technology selection case by case. • Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.
  • 38. Q & A
  • 39. Email: [email protected] Skype: hichawy LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230 Thank You