SlideShare a Scribd company logo
ANALYZING BIG DATA IN R AND SCALA USING
APACHE SPARK
17-07-2019
By: Ahmed Elsayed
M.Sc. Information Systems.
Director of Software Applications Department - IT
Alexandria Petroleum Maintenance Co. "Petromaint"
AhmedElsayeddb@gmail.com
 DATA SCIENCE
 BIG DATA
 HADOOP
o INSTALL & LEARN
 R
o INSTALL & LEARN
 SPARK & SPARKR
o INSTALL & LEARN
 CASES STUDY
o FIRST CASETITANICTEST EXAMPLE
o SECOND CASE CRIMES EXAMPLE
 SPARKRON R USING RSTUDIO
o THIRD CASE DELAY PREDICTION
 SCALA ON ZEPPELIN
 LEARN AND be IBM CERTIFIED
MOTIVATION
• Make a prediction from dataset to know not only what
happened or why it happened but also what will happen.
• Make the machine give the most accurate answer(class
value or group) from new data which entered by user
learning from punch of historical data which is so big to be
handled by human or even by one machine memory
Data Science
5
DATA SCIENCE
6
Machine learning (Big Data Analytics).
Choosing dataset (in progress to choose at least 5GB
dataset file).
Pre-processing on dataset for removing missing values or
refilling it.
Implementing classification and clustering
DATA SCIENCE
7
Data Science is a combination of:
Studies of managing, storing, and analyzing data.
Mathematics, statistics, programming.
Ways of capturing data might not been captured till now.
Ability to look at things 'differently'.
Activity of cleansing, preparing and aligning the data.
DATA SCIENCE
8
Machine learning
Branch of Computer Science.
Low-level algorithms to discover patterns implicit in the
data.
The more data, the more effective learning.
Which is why machine learning and big data are intricately
tied together.
DATA SCIENCE
Big Data
10
BIG DATA Why Big Data
 Used to process, analyze and store large amount of data.
 Structured and unstructured .
 (computers, mobile devices, satellites, cameras, images etc.).
 Exceeds the processing capacity of traditional DBMS.
 Over 90% of World data generated last two years.
 Scale up from single servers to thousands of machines.
 Big data is valuable for organization falls in two categories:
o Predicting new products basis on products data history.
o Data sizes from TB to many PB in a single sets of data.
 Hadoop is an open source framework which does all above.
11
BIG DATA Why Big Data
12
BIG DATA Why Big Data
13
changes our entire way of thinking about predictive analytics,
knowledge extraction and interpretation.
trial-and-error analysis, approach becomes impossible when
datasets are large and heterogeneous.
very few tools allow for processing large datasets in reasonable
amount of time.
traditional statistical solutions typically focus on static analytics that
is limited to the analysis of samples that are frozen in time, which
often results in surpassed and unreliable conclusions.
BIG DATA Big Data Analytics
14
BIG DATA Big Data Analytics
15
BIG DATA Big Data Analytics
Hadoop
17
HADOOP
 Is an open source software framework developed in java.
 Processing, querying huge amount of data.
 On large clusters of commodity hardware.
 Divide massive data into smaller chunks.
 Spread it out over many machines.
 Each machine can process those chunks in parallel.
 So results can be obtained extremely fast.
 Apache Hadoop has two main components:
 HDFS.
 MapReduce.
18
Hadoop Distributed File System (HDFS)
 Derived from the concept of Google File System (GFS).
 It is a data storage layer based on the UNIX.
 Creates multiple replicas of each data block.
 Distributes them on computers throughout a cluster.
 To enable reliable and rapid access.
 Suitable for applications have large data sets.
HADOOP
19
HADOOP Hadoop Distributed File System (HDFS)
20
MapReduce
 Core component of the Hadoop.
 Processing Big Data distributed over thousands of nodes.
 processes chunks in parallel.
 Later individual results are combined together to get result.
 This whole processing is done in two phases: (Map
,Reduce).
HADOOP
21
MapReduceHADOOP
22
YARNHADOOP
23
 Hadoop is a cluster resource management platform.
 Responsible for managing computing resources in clusters.
 Using them for scheduling of users applications
 Resource manager (one per cluster)
 Node managers running on all the nodes in the cluster
 To launch and monitor containers.
YARNHADOOP
24
 MasterNode: storing data (HDFS), parallel computations (MR).
 Slave/Worker Node: machines do all works assigned to them from
MasterNodes.
 NameNode: master of the HDFS system. maintains all the directories, files,
manages the blocks present on DataNodes.
 DataNode: machine actual storage. like slaves of HDFS. are responsible for
serving read-write requests for the clients.
 JobTracker: do parallel processing of data using MapReduce. This process is
assigned to interact with clients applications.
 TaskTracker: process that executes tasks assigned to it from JobTracker like
Map, Reduce and Shuffle.
Master-Slave architectureHADOOP
25
Master-Slave architectureHADOOP
 Load balancing , Node failures, Cluster expansion, Highly fault-tolerant.
 Typically 128 MB block size three copies(chunks) are maintained:
 One on the same node.
 One on the same rack but on different node.
 One on the other rack on different node.
 Information about all these copies is maintained on the NameNode.
 Client accesses data directly from DataNode.
 Allow move processing to data. High throughput.
 Suitable for applications with large data sets.
 Streaming access to file system data.
 Can be built out of commodity hardware.
26
multi-node clusterHADOOP
27
multi-node clusterHADOOP
28
multi-node clusterHADOOP
29
multi-node clusterHADOOP
30
multi-node clusterHADOOP
31
EcosystemsHADOOP
32
Install & learnHADOOP
HADOOP MULTI NODE CLUSTER ON UBUNTU IN 30 MINUTES
HADOOP 2.7.0 MULTI NODE CLUSTER SETUP ON UBUNTU 15.04
HADOOP TUTORIAL FOR BIG DATA ENTHUSIASTS
R
34
 R is becoming the most popular language for data science.
 R is data analysis software: statistical analysis, data visualization,
and predictive modeling.
 R is a programming language: An object-oriented language.
 R is an open-source software project: integrate with other
applications and systems.
 R is a community: thousands of contributors have created add-on
packages. With two million users, R boasts a vibrant online
community
Why RR
35
 Cloud For Bigger Data and R is the most way to analyze data.
 R is one of the fastest growing languages in the world.
 R has one of the best visualization in analytics software.
 It is open source, free, 8000 plus packages built in.
 Supported (Google, Oracle, Microsoft, Sap, Sas Institute, Ibm, etc…). With
GUI packages easily to start analyzing data in R
 Community System (conferences, help groups, books, startups, experienced
companies).
 RStudio IDE for helping business users with faster project execution and
easier transition to the R platform.
Why Should Cloud Users Learn More About R?R
36
RHadoop Example CodeR
37
Install & learnR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
R TUTORIAL – OUTSTANDING INTRODUCTION TO R PROGRAMMING
FOR DATA SCIENCE!
SPARK & SPARKR
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
What is Spark?SPARK
•http://spark.apache.org/
• Speed
• Ease of use
• Generality
• Integrated with Hadoop
• Scalability
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
Motivation to use SparkSPARK
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
OriginSPARK
RDD (Resilient Distributed Dataset)SPARK
Iterative Operations on MapReduceSPARK
Iterative Operations on Spark RDDSPARK
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SPARKR
How does Sparkr cluster works?SPARKR
52
Spark Driver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
R
R
SparkR architecture (since 2.0)SPARKR
53
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect
Overview of SparkR APISPARKR
Overview of SparkR APISPARKR
RStudioSPARKR
Apache ZeppelinSPARKR
57
Install & learn
INSTALL APACHE SPARK ON MULTI-NODE CLUSTER
SPARK TUTORIAL – LEARN SPARK PROGRAMMING
INSTALLING SPARKR
SPARKR AND R – DATAFRAME AND DATA.FRAME
INSTALLING R ON HADOOP CLUSTER TO RUN SPARKR
INSTALL R, R STUDIO AND R PACKAGES IN SIMPLE STEPS
SPARKR
BUILDING ZEPPELIN-WITH-R ON SPARK AND ZEPPELIN
CASE STUDIES
FIRST CASE
TITANIC TEST EXAMPLE
http://amunategui.github.io/databricks-spark-bayes/
TITANIC TEST EXAMPLE
http://amunategui.github.io/databricks-spark-bayes/
My Work
TITANIC TEST EXAMPLE
My Work
TITANIC TEST EXAMPLE
My Work
TITANIC TEST EXAMPLE
SECOND CASE
CRIMES EXAMPLE
Cluster specs
CRIMES EXAMPLE
6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm) 8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network card: Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
Hdslave1, Hdslave2, Hdslave3, Hdslave4 and Hdslave5:
Processor: Intel Core 2 Duo CPU E8400 3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network card: Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
6 Machines connected together on 1 switch (Gigabits), speed approximately 600 Mbit.
CRIMES EXAMPLE
1Master and 5 slaves
Hadoop ClusterCRIMES EXAMPLE
1Master and 5 slaves
Hadoop ClusterCRIMES EXAMPLE
1 Driver and 6 Workers
Spark Standalone ClusterCRIMES EXAMPLE
Dataset
Spark Standalone ClusterCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
Dataset
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
Dataset
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
DatasetCRIMES EXAMPLE
https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
Dataset downloaded 19-11-2016
DatasetCRIMES EXAMPLE
SPARKR ON R USING RSTUDIO
Preprocessing
Initiating Sparkr
PreprocessingCRIMES EXAMPLE
PreprocessingCRIMES EXAMPLE
Splitting Dataset
Splitting dataset to capture specific columns not all columns in a new
dataset(Crimes2001topresent.csv 1.5 GB, 6208265 Rows).
> path1 <- file.path("hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2001topresent.csv")
> system.time(path <- (cache(read.df( path =path1 , source = "com.databricks.spark.csv", inferSchema = "true",header="true"))))
[Stage 2:> (0 + 11) / 11][Stage 2:=====> (1 + 10) / 11][Stage
2:=====================> (4 + 7) / 11][Stage 2:==========================> (5 + 6) /
11][Stage 2:====================================> (7 + 4) / 11][Stage
2:===============================================> (9 + 2) / 11][Stage
2:===================================================> (10 + 1) / 11]
user system elapsed
0.076 0.064 35.736
PreprocessingCRIMES EXAMPLE
Splitting Dataset
> createOrReplaceTempView(path, "path")
> dssql <- sql("SELECT PrimaryType,LocationDescription,Arrest,Domestic,District,Beat,Year FROM path ")
> system.time(write.df(repartition(dssql, 1), "hdfs://hdmaster:9000/user/ahmed/data/crimes/Crimes2", source="csv", mode = "overwrite"))
[Stage 3:> (0 + 13) / 13][Stage 3:====> (1 + 12) / 13][Stage
3:=================> (4 + 9) / 13][Stage 3:==========================> (6 + 7) /
13][Stage 3:===============================> (7 + 6) / 13][Stage
3:===================================> (8 + 5) / 13][Stage
3:===========================================> (10 + 3) / 13][Stage
3:================================================> (11 + 2) / 13] 0.531
PreprocessingCRIMES EXAMPLE
Downloading output file
PreprocessingCRIMES EXAMPLE
Spark dataframe for new file crimessplited.csv
PreprocessingCRIMES EXAMPLE
Null values
PreprocessingCRIMES EXAMPLE
Preparing Main Spark dataframe from sql
PreprocessingCRIMES EXAMPLE
Naïve Bayes
Algorithm
Naïve Bayes
Splitting Dataset to (Test, Train).
Learning phaseCRIMES EXAMPLE
Naïve Bayes and predicting algorithm from spark Mlib
Naïve Bayes
Learning phaseCRIMES EXAMPLE
Apriori and samples of class weight for each column
Learning phaseCRIMES EXAMPLE
Convert spark dataframe to local R dataframe for confusionmatrix purposes
Prediction phaseCRIMES EXAMPLE
Confusionmatrix
Prediction phaseCRIMES EXAMPLE
Predicting
&
suggesting
Predicting
Prediction phaseCRIMES EXAMPLE
Suggesting
Prediction phaseCRIMES EXAMPLE
Suggesting
Prediction phaseCRIMES EXAMPLE
Visualization
Graph
Convert spark dataframe to local R dataframe for graph ggplot2 purposes
GraphCRIMES EXAMPLE
GraphCRIMES EXAMPLE
Transforming prediction and arrest to be Boolean instead of nominal for Plotting purposes.
GraphCRIMES EXAMPLE
GraphCRIMES EXAMPLE
THIRD CASE
DELAY PREDICTION
104
 The dataset made up of records of all USA domestic flights of
major carriers.
 “Airline on-time performance” downloaded as CSV file.
 Details of the arrival and departure of all commercial flights in
the US, from October 1987 to April 2008.
 Total of nearly 123 million records stored on 12 gigabytes.
DatasetDELAY PREDICTION
105
• Year : 1987-2008,
• Month: 1-12,
• DayofMonth: 1-31,
• DayOfWeek: 1 (Monday) - 7 (Sunday),
• DepTime:actual departure time,
• CRSDepTime: scheduled departure time
• ArrTime: actual arrival time,
• CRSArrTime: scheduled arrival time,
• UniqueCarrier: unique carrier code
• FlightNum: flight number,
• TailNum: plane tail number,
• ActualElapsedTim: in minutes,
• CRSElapsedTime: in minutes,
• AirTime: in minutes.
• ArrDelay: arrival delay, in minutes.
• DepDelay: departure delay in minutes.
• Origin: origin IATA airport code
• Dest: destination IATA airport code
• Distance: in miles,
• TaxiIn: taxi in time in minutes,
• TaxiOut: taxi out time in minutes,
• Cancelled: was the flight cancelled?,
• CancellationCode: reason for cancellation
(A =carrier, B =weather, C=NAS, D =
security),
• Diverted: 1 = yes 0 = no,
• CarrierDelay: in minutes,
• WeatherDelay: in minutes,
• NASDelay: in minutes,
• SecurityDelay: in minutes,
• LateAircraftDelay: in minutes.8
Variables descriptions(29 variables):
DatasetDELAY PREDICTION
106
Class
•Class was built depending on U.S. Department of
transportation federal aviation administration (FAA).
•Ontime binary class: if departure delay <15 then ‘yes’
or if it is delay>15 or is canceled then ‘no’.
•Criteria: Jan-2004, instances selected (583.9K rows).
•70% for the training (407.7K rows) and 30% for the test
(176.2K rows).
Classification Algorithms ComparisonDELAY PREDICTION
107
Performance classification comparison
As an answer to first question “what is the best classification
algorithm to use from SparkR MLib?”. And as shown in table
(4).
Classification Algorithms ComparisonDELAY PREDICTION
108
Binary Class Test
DELAY PREDICTION
109
Spark Cluster Over the Hadoop Cluster
Hadoop ClusterDELAY PREDICTION
110
Hadoop ClusterDELAY PREDICTION
Actual cluster : as
shown the true
picture illustrates the
physical cluster
machines
111
Hadoop Cluster Specs: 6 Machines specifications
Hdmaster:
Processor: AMD Phenom(tm)
8600B
Cores: 3
Memory: 8 GB
Hard disk: 120 GB
Network
card:
Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
Hdslave (1,2,3,4, and 5)
Processor: Intel Core 2 Duo CPU E8400
3.00GHz
Cores: 2
Memory: 4 GB
Hard disk: 40 GB
Network
card:
Gigabit
OS: Linux (Ubuntu 14)
System type: 64-pit
6 Machines connected together.
Hadoop Version 2.6
Hadoop ClusterDELAY PREDICTION
112
 If departure delay <15 then on-time is ‘True’ and
 If it is >15 or is canceled then on-time is ‘False’.
Binary Class TestDELAY PREDICTION
113
Dividing Dataset
 The selected range of data is 15 years with 91,449,659
instances.
 The full dataset is separated into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
 The split and validation done using Holdout Validation
Technique.
 Training and test sets are cashed in spark dataframe
cluster.
Binary Class TestDELAY PREDICTION
114
The Result
The test for both (departure and arrival) delays prediction
Binary Class TestDELAY PREDICTION
115
Predicting The Departure And Arrival Flight
Delays In One Process
Multinomial ClassDELAY PREDICTION
116
Preprocessing Using SparkR SQL
 Many attributes pruned (10 columns) according to their lack
of data or some columns are empty.
 (AirTime, TailNum, TaxiIn, TaxiOut, CancellationCode,
CarrierDelay, WeatherDelay, NASDelay, SecurityDelay and
LateAircraftDelaythe).
 The rest of columns were selected.
 The selected range of data is 15 years with 91,449,659
instances.
Multinomial ClassDELAY PREDICTION
117
Proposed multinomial class (On-time)
When DepDelay <15 and ArrDelay <15 then ‘Both Ontime’.
When DepDelay >15 and ArrDelay >15 then ‘Both Delayed’.
When DepDelay >15 and ArrDelay <15 then ‘Origin Delay’.
When DepDelay <15 and ArrDelay >15 then ‘Destination Delay’.
When the Canceled is true then ‘Both Delayed’.
Features Selector (RFormula)
Rformula is used for the rest of selected columns
Multinomial ClassDELAY PREDICTION
118
Dataset Splitting
The full dataset after the features selection process is separated
into:
 70% as a training set with 64,020,457 Instances.
 30% as a testing set with 27,429,202 instances.
The dividing and validation done using Holdout Validation
Technique.
Multinomial ClassDELAY PREDICTION
119
Predicting the departure and
arrival flight delays in one
process
Learning phase
Multinomial ClassDELAY PREDICTION
120
Predicting phase
Predicting the departure and
arrival flight delays in one
process
Multinomial ClassDELAY PREDICTION
121
Prediction & validation Metrics
Prediction Instances and accuracy
Prediction metrics
Multinomial ClassDELAY PREDICTION
122
Prediction & validation Metrics
Prediction confusion matrix
Multinomial ClassDELAY PREDICTION
123
Shiny web Page
DPDAD model Interface
Multinomial ClassDELAY PREDICTION
There are no delays in origin and
destination airports they are
Both Ontime
95.4%
Suggesting the top ten carriers and its probabilities.
 Running prediction on stored ML approach and
Making the whole dataset from Hadoop as a test set.
 Using Spark SQL to select the top ten carriers with
highest probabilities and prediction class equal “yes”
124
Multinomial ClassDELAY PREDICTION
SCALA ON ZEPPELIN
126
Scala on zeppelinDELAY PREDICTION
Figure (1). The code of how to read dataset file from Hadoop
127
DELAY PREDICTION
Figure (2). The code of how to use the Spark Sql to handle the
dataset for:
 Missing data.
 Corrupting data.
 Time values ranges.
 Making the multinomial class.
And using the RFormula as a feature selector.
Scala on zeppelin
128
Multinomial ClassDELAY PREDICTION
129
DELAY PREDICTION Scala on Zeppelin
Figure (3). The output
sample for figure (2).
130
DELAY PREDICTION Scala on Zeppelin
Figure (4). The code of splitting the dataset using the Holdout Validation Technique and
caching it as a Spark storage.
Figure (5). The counting for training data and testing data.
131
DELAY PREDICTION Scala on Zeppelin
Figure (6). The running
of Naïve-Bayes
algorithm as a learning
phase.
Figure (7). The
prediction phase and
a sample of the
output.
132
DELAY PREDICTION Scala on Zeppelin
Figure (8). The counting of the actual and
prediction class.
Figure (9). Calculating the confusion
matrix and metrics.
133
DELAY PREDICTION Scala on Zeppelin
134
DELAY PREDICTION Scala on Zeppelin
135
DELAY PREDICTION Scala on Zeppelin
136
IBM BadgesCERTIFICATES
IBM - BIG DATA 101
IBM - HADOOP 101
IBM - SPARK FUNDAMENTALS I
Thank You

More Related Content

What's hot (20)

PPTX
Hadoop project design and a usecase
sudhakara st
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
PPTX
Big data & hadoop
TejashBansal2
 
PPTX
Big data processing with apache spark part1
Abbas Maazallahi
 
PPTX
Big data concepts
Serkan Özal
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PDF
Open source analytics
Ajay Ohri
 
PDF
Introduction to Hadoop and MapReduce
eakasit_dpu
 
PDF
Seminar_Report_hadoop
Varun Narang
 
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
PPTX
Big Data Concepts
Ahmed Salman
 
PDF
Big Data: hype or necessity?
Bart Vandewoestyne
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Hadoop project design and a usecase
sudhakara st
 
Big data and Hadoop
Rahul Agarwal
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Big data & hadoop
TejashBansal2
 
Big data processing with apache spark part1
Abbas Maazallahi
 
Big data concepts
Serkan Özal
 
Introduction to Apache Hadoop
Christopher Pezza
 
Open source analytics
Ajay Ohri
 
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Seminar_Report_hadoop
Varun Narang
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big Data Concepts
Ahmed Salman
 
Big Data: hype or necessity?
Bart Vandewoestyne
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Introduction to Hadoop Technology
Manish Borkar
 
Big Data and Hadoop Basics
Sonal Tiwari
 
Big Data and Hadoop
Flavio Vit
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 

Similar to Analyzing Big data in R and Scala using Apache Spark 17-7-19 (20)

PDF
CSB_community
Albert Anthony Gavino, MBA
 
PDF
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
PDF
Big data and hadoop
AshishRathore72
 
PDF
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
PDF
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
PPTX
Big data ppt
Shweta Sahu
 
PPT
Big Data & Hadoop
Krishna Sujeer
 
PPTX
Hadoop
Zubair Arshad
 
PDF
Big data processing with apache spark
sarith divakar
 
PDF
Why Spark over Hadoop?
Prwatech Institution
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
PPTX
THE SOLUTION FOR BIG DATA
Tarak Tar
 
PPTX
THE SOLUTION FOR BIG DATA
Tarak Tar
 
PPTX
Hadoop and BigData - July 2016
Ranjith Sekar
 
DOCX
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
PDF
BIG DATA
Dr. Shashank Shetty
 
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 
PPT
Hadoop and Mapreduce Introduction
rajsandhu1989
 
ODP
Hadoop seminar
KrishnenduKrishh
 
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Big data and hadoop
AshishRathore72
 
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Big data ppt
Shweta Sahu
 
Big Data & Hadoop
Krishna Sujeer
 
Big data processing with apache spark
sarith divakar
 
Why Spark over Hadoop?
Prwatech Institution
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
THE SOLUTION FOR BIG DATA
Tarak Tar
 
THE SOLUTION FOR BIG DATA
Tarak Tar
 
Hadoop and BigData - July 2016
Ranjith Sekar
 
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 
Hadoop and Mapreduce Introduction
rajsandhu1989
 
Hadoop seminar
KrishnenduKrishh
 
Ad

Recently uploaded (20)

PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PPTX
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PPTX
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PPTX
Smart_Workplace_Assistant_Presentation (1).pptx
kiccha1703
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PDF
Data science AI/Ml basics to learn .pdf
deokhushi04
 
PDF
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PDF
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
The Influence off Flexible Work Policies
sales480687
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
SaleServicereport and SaleServicereport
2251330007
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Mynd company all details what they are doing a
AniketKadam40952
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Smart_Workplace_Assistant_Presentation (1).pptx
kiccha1703
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
Data science AI/Ml basics to learn .pdf
deokhushi04
 
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Ad

Analyzing Big data in R and Scala using Apache Spark 17-7-19

Editor's Notes

  • #6: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #7: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #8: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #9: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #11: REF. [3]
  • #12: REF. [3]
  • #13: REF. [3]
  • #14: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #15: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #16: REF. [4] https://github.com/RevolutionAnalytics/RHadoop/wiki
  • #18: REF. [4]
  • #19: REF. [4]
  • #20: REF. [4]
  • #21: REF. [4]
  • #22: REF. [4]
  • #23: REF. [5]
  • #24: REF. [5]
  • #25: REF. [4]
  • #26: REF. [5]
  • #27: REF. [5]
  • #28: REF. [5]
  • #29: REF. [5]
  • #30: REF. [5]
  • #31: REF. [5]
  • #32: REF. [5]
  • #33: REF. [5]
  • #35: REF. [6]
  • #36: REF. [6]
  • #37: REF. [4]
  • #38: REF. [4]
  • #51: However, there’s one drawback: Traditionally, the R internal is single-threaded. It is unclear how R programs can be effectively and concisely written to run on multiple machines. So, what if we can combine these two worlds? This is where SparkR comes in: it is a language binding that lets users write R programs that are equipped with nice statistics packages, and have them run on top of Spark.
  • #53: Worker refers to Worker machine Mention that all Spark data sources work
  • #58: REF. [4]
  • #137: REF. [5]