SlideShare a Scribd company logo
Spark
Contents
 Overview
 Layers and packages in Spark
 Download and Installation
 Spark Application Overview
 Simple Spark Application
 Introduction to Spark
 How a Spark Application Runs on a cluster?
 Spark Abstraction
 Scala introduction
 Getting started in scala
 Example programs with Scala and Python
 Prediction with Regressions utilizing MLlib
Introduction to Spark
 Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of
circumstances. on top of the spark core data processing engine, there are libraries for sql, machine
learning, graph computation, and stream processing, which can be used together in an application.
 Programming languages supported by spark include: java, python, scala, and r. application developers
and data scientists incorporate spark into their applications to rapidly query, analyze, and transform data
at scale.
 Tasks most frequently associated with spark include etl and sql batch jobs across large data sets,
processing of streaming data from sensors, iot, or financial systems, and machine learning tasks.
Overview
 Spark is general purpose engine for large-scale data processing.
 Spark SQL for querying structured data via SQL and Hive Query Language (HQL)
 Runs on Hadoop
 Uses APIs to help execute workloads
 It has been claimed to be 100 times faster than Hadoop’s MapReduce
 It Supports Java, Python, R, and Scala programming languages
asks most frequently associated with Spark include ETL and SQL batch jobs across
large data sets, processing of streaming data from sensors, IoT, or financial systems, and machine
learning tasks
Layers and packages in Spark
 Spark SQL allows for querying structured data via SQL and Hive Query Language (HQL).
 It has its own Graph Computation Engine, called GraphX that allows users to make computations using
graphs.
 Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data.
You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.
 MLLib is a machine learning library that is built on top of Spark, and has the provision to support many
machine learning algorithms.
 But the point difference is that it runs almost 100 times faster than MapReduce.
Download and Installation
 Download at http://spark.apache.org/downloads.html
 Prior to downloading, ensure that Java JDK and Scala is installed on your machine.
 Spark requires Java to run and Scala is used to implement Spark.
 Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR file. Unpack
the tar file after downloading.
 Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.
 It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries)
or Python.
 Open the spark shell by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ” for Python Version
Download and Installation
 Configure the environment according to your needs/preferences using options such as the SparkConf or the
Spark Shell tools. You should also be able to configure the settings using the Installation Wizard for the Spark
application.
 Initialize a new SparkContext using your preferred language (i.e. Python, Java, Scala, R). SparkContext sets up
services and connects to an execution environment for Spark applications.
Spark Application Overview
 Each Spark application is a self-contained computation that runs user-supplied code to compute a result like
MapReduce applications But, Spark has many advantages over MapReduce.
 In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit of
computation is an application.
 A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived
server continually satisfying requests.
 Multiple tasks can run within the same executor. Both combine to enable extremely fast task startup time as
well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.
 Spark application execution involves runtime concepts such as driver, executor, task , job and stage
Simple Spark Application
Spark application can be developed in 3 of the supported following languages.
1)Java 2) Python 3) Scala
 Spark provides primarily two abstractions in its applications:
 RDD (Resilient Distributed Dataset) --- (Dataset in newest version)
 Two types of Shared Variables in parallel Operations
 Broadcast variables: Can be stored in the cache of each system
 Accumulators : Can help with aggregation functions such as addition
○Accumulators
Simple Spark Applications Continued
Accumulator: Stores variable that can have additive and
cumulative functions performed onto it. Safer than using
“Global” declaration
Parallelize: Stores iterable data such as a list onto a
distributed dataset to be sent to clusters on network
Broadcast: Sends variable to other memory devices in
cluster. Efficient algorithm pre-built that helps reduce
communication costs. Should only be used if the same
data is needed in all nodes
Foreach (Reducer): Runs function on each element in
dataset, output is best to be an accumulator object for
safe updates
How a Spark Application Runs on a cluster?
 A Spark application runs as independent processes,
coordinated by the SparkSession object in the driver
program.
 The resource or cluster manager assigns tasks to
workers, one task per partition.
 A task applies its unit of work to the dataset in its
partition and outputs a new partition dataset. Because
iterative algorithms apply operations repeatedly to data,
they benefit from caching datasets across iterations.
 Results are sent back to the driver application or can be
saved to disk.
Spark Abstraction
Resilient Distributed Database (RDD)
 It is the fundamental abstraction in Apache Spark. It is the basic data structure. RDD in Apache
Spark is an immutable collection of objects which computes on the different node of the cluster.
 Resilient- i.e. fault-tolerant ,so able to recompute missing or damaged partitions due to node
failures.
 Distributed- since Data resides on multiple nodes.
 Dataset -represents records of the data you work with. The user can load the data set externally
which can be either JSON file, CSV file, text file or database via JDBC with no specific data
structure.
contd..
 Data Frame
We can term DataFrame as Dataset organized into columns. DataFrames are similar to the table in a relational
database or data frame in R /Python.
 Spark Streaming
It is a Spark’s core extension, which allows Real-time stream processing from several sources. To offer a unified,
continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work
together. It offers scalable, high-throughput and fault-tolerant processing.
 GraphX
It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other
graphs alongside Excel-like two-dimensional data.
What is scala?
 Scala is a general purpose language that can be used to develop solutions for any software problem.
 Scala combines object-oriented and functional programming in one concise, high-level language.
 Completely compatible with java consequently runs on JVM
 Scala offers a toolset to write scalable concurrent applications in a simple way with more confidence in their
correctness.
 Scala is an excellent base of parallel, distributed, and concurrent computing, which is widely thought to be a
very big challenge in software development but by the unique combination of features has won this
challenge.
Why Scala??
 Using Scala apps are less costlier to maintain and easier to evolve Scala because Scala is a functional and
object-oriented programming language that makes light bend reactive and helps developers write code that's
more concise than other options.
 Scala is used outside of its killer-app domain as well, of course, and certainly for a while there was a hype
about the language that meant that even if the problem at hand could easily be solved in Java, Scala would
still be the preference, as the language was seen as a future replacement for Java.
 It reduces the amount of code developers to write the code.
Getting Started in Scala:
•scala
–Runs compiled scala code
–Or without arguments, as an interpreter!
•scalac - compiles
• fsc - compiles faster! (uses a background server to minimize startup time)
•Go to scala-lang.org for downloads/documentation
•Read Scala: A Scalable Language
(see http://www.artima.com/scalazine/articles/scalable-language.html )
Example programs with Scala and Python
WordCount:In this example we use few transformations to build a dataset of (String,
int) pairs called counts and save it to a file.
Python:
text_file = sc.textFile("hdfs://...")
# Read data from file
counts =
text_file.flatMap(lambda line:
line.split(" ")) 
.map(lambda word:
(word, 1))
.reduceByKey(lambda
a, b: a + b)
counts.saveAsTextFile("hdfs://...")
#Save output into the file
Scala:
val textFile =
sc.textFile("hdfs://...")
val counts = textFile.flatMap(line
=> line.split(" "))
.map(word => (word,
1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...
")
Prediction with Regressions utilizing MLlib
Python:
#Read data into dataframe
df = sqlContext.createDataFrame(data,
["label","features"])
lr = LogisticRegression(maxIter=10)
#Fit the model to the data
model = lr.fit(df)
# Predicting the results using the model
model.transform(df).show()
Scala:
// Read data into dataframe
val df =
sqlContext.createDataFrame(data).toDF("lab
el", "features")
val lr = new
LogisticRegression().setMaxIter(10)
val model = lr.fit(df)
val weights = model.weights
// Predicting the results using the model
THANKYOU
Ad

Recommended

spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Apache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Spark core
Spark core
Prashant Gupta
 
A Deep Dive Into Spark
A Deep Dive Into Spark
Ashish kumar
 
Apache Spark Core
Apache Spark Core
Girish Khanzode
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
SparkPaper
SparkPaper
Suraj Thapaliya
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache spark
Apache spark
Dona Mary Philip
 
Let's start with Spark
Let's start with Spark
Milos Milovanovic
 
APACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark
Apache spark
Prashant Pranay
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Spark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
Best MLM Compensation Plans for Network Marketing Success in 2025
Best MLM Compensation Plans for Network Marketing Success in 2025
LETSCMS Pvt. Ltd.
 

More Related Content

Similar to An Introduction to Apache spark with scala (20)

Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
SparkPaper
SparkPaper
Suraj Thapaliya
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache spark
Apache spark
Dona Mary Philip
 
Let's start with Spark
Let's start with Spark
Milos Milovanovic
 
APACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark
Apache spark
Prashant Pranay
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Spark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Spark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 

Recently uploaded (20)

HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
Best MLM Compensation Plans for Network Marketing Success in 2025
Best MLM Compensation Plans for Network Marketing Success in 2025
LETSCMS Pvt. Ltd.
 
Heat Treatment Process Automation in India
Heat Treatment Process Automation in India
Reckers Mechatronics
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Advance Doctor Appointment Booking App With Online Payment
Advance Doctor Appointment Booking App With Online Payment
AxisTechnolabs
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Emvigo Capability Deck 2025: Accelerating Innovation Through Intelligent Soft...
Emvigo Capability Deck 2025: Accelerating Innovation Through Intelligent Soft...
Emvigo Technologies
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Simplify Task, Team, and Project Management with Orangescrum Work
Simplify Task, Team, and Project Management with Orangescrum Work
Orangescrum
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
 
IObit Driver Booster Pro 12 Crack Latest Version Download
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Reimagining Software Development and DevOps with Agentic AI
Reimagining Software Development and DevOps with Agentic AI
Maxim Salnikov
 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
Best MLM Compensation Plans for Network Marketing Success in 2025
Best MLM Compensation Plans for Network Marketing Success in 2025
LETSCMS Pvt. Ltd.
 
Heat Treatment Process Automation in India
Heat Treatment Process Automation in India
Reckers Mechatronics
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Advance Doctor Appointment Booking App With Online Payment
Advance Doctor Appointment Booking App With Online Payment
AxisTechnolabs
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Emvigo Capability Deck 2025: Accelerating Innovation Through Intelligent Soft...
Emvigo Capability Deck 2025: Accelerating Innovation Through Intelligent Soft...
Emvigo Technologies
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Simplify Task, Team, and Project Management with Orangescrum Work
Simplify Task, Team, and Project Management with Orangescrum Work
Orangescrum
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Test Case Design Techniques – Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
 
IObit Driver Booster Pro 12 Crack Latest Version Download
IObit Driver Booster Pro 12 Crack Latest Version Download
pcprocore
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Reimagining Software Development and DevOps with Agentic AI
Reimagining Software Development and DevOps with Agentic AI
Maxim Salnikov
 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
Ad

An Introduction to Apache spark with scala

  • 2. Contents  Overview  Layers and packages in Spark  Download and Installation  Spark Application Overview  Simple Spark Application  Introduction to Spark  How a Spark Application Runs on a cluster?  Spark Abstraction  Scala introduction  Getting started in scala  Example programs with Scala and Python  Prediction with Regressions utilizing MLlib
  • 3. Introduction to Spark  Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. on top of the spark core data processing engine, there are libraries for sql, machine learning, graph computation, and stream processing, which can be used together in an application.  Programming languages supported by spark include: java, python, scala, and r. application developers and data scientists incorporate spark into their applications to rapidly query, analyze, and transform data at scale.  Tasks most frequently associated with spark include etl and sql batch jobs across large data sets, processing of streaming data from sensors, iot, or financial systems, and machine learning tasks.
  • 4. Overview  Spark is general purpose engine for large-scale data processing.  Spark SQL for querying structured data via SQL and Hive Query Language (HQL)  Runs on Hadoop  Uses APIs to help execute workloads  It has been claimed to be 100 times faster than Hadoop’s MapReduce  It Supports Java, Python, R, and Scala programming languages asks most frequently associated with Spark include ETL and SQL batch jobs across large data sets, processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks
  • 5. Layers and packages in Spark  Spark SQL allows for querying structured data via SQL and Hive Query Language (HQL).  It has its own Graph Computation Engine, called GraphX that allows users to make computations using graphs.  Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.  MLLib is a machine learning library that is built on top of Spark, and has the provision to support many machine learning algorithms.  But the point difference is that it runs almost 100 times faster than MapReduce.
  • 6. Download and Installation  Download at http://spark.apache.org/downloads.html  Prior to downloading, ensure that Java JDK and Scala is installed on your machine.  Spark requires Java to run and Scala is used to implement Spark.  Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR file. Unpack the tar file after downloading.  Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.  It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.  Open the spark shell by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ” for Python Version
  • 7. Download and Installation  Configure the environment according to your needs/preferences using options such as the SparkConf or the Spark Shell tools. You should also be able to configure the settings using the Installation Wizard for the Spark application.  Initialize a new SparkContext using your preferred language (i.e. Python, Java, Scala, R). SparkContext sets up services and connects to an execution environment for Spark applications.
  • 8. Spark Application Overview  Each Spark application is a self-contained computation that runs user-supplied code to compute a result like MapReduce applications But, Spark has many advantages over MapReduce.  In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit of computation is an application.  A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests.  Multiple tasks can run within the same executor. Both combine to enable extremely fast task startup time as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.  Spark application execution involves runtime concepts such as driver, executor, task , job and stage
  • 9. Simple Spark Application Spark application can be developed in 3 of the supported following languages. 1)Java 2) Python 3) Scala  Spark provides primarily two abstractions in its applications:  RDD (Resilient Distributed Dataset) --- (Dataset in newest version)  Two types of Shared Variables in parallel Operations  Broadcast variables: Can be stored in the cache of each system  Accumulators : Can help with aggregation functions such as addition ○Accumulators
  • 10. Simple Spark Applications Continued Accumulator: Stores variable that can have additive and cumulative functions performed onto it. Safer than using “Global” declaration Parallelize: Stores iterable data such as a list onto a distributed dataset to be sent to clusters on network Broadcast: Sends variable to other memory devices in cluster. Efficient algorithm pre-built that helps reduce communication costs. Should only be used if the same data is needed in all nodes Foreach (Reducer): Runs function on each element in dataset, output is best to be an accumulator object for safe updates
  • 11. How a Spark Application Runs on a cluster?  A Spark application runs as independent processes, coordinated by the SparkSession object in the driver program.  The resource or cluster manager assigns tasks to workers, one task per partition.  A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations.  Results are sent back to the driver application or can be saved to disk.
  • 12. Spark Abstraction Resilient Distributed Database (RDD)  It is the fundamental abstraction in Apache Spark. It is the basic data structure. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster.  Resilient- i.e. fault-tolerant ,so able to recompute missing or damaged partitions due to node failures.  Distributed- since Data resides on multiple nodes.  Dataset -represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
  • 13. contd..  Data Frame We can term DataFrame as Dataset organized into columns. DataFrames are similar to the table in a relational database or data frame in R /Python.  Spark Streaming It is a Spark’s core extension, which allows Real-time stream processing from several sources. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing.  GraphX It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data.
  • 14. What is scala?  Scala is a general purpose language that can be used to develop solutions for any software problem.  Scala combines object-oriented and functional programming in one concise, high-level language.  Completely compatible with java consequently runs on JVM  Scala offers a toolset to write scalable concurrent applications in a simple way with more confidence in their correctness.  Scala is an excellent base of parallel, distributed, and concurrent computing, which is widely thought to be a very big challenge in software development but by the unique combination of features has won this challenge.
  • 15. Why Scala??  Using Scala apps are less costlier to maintain and easier to evolve Scala because Scala is a functional and object-oriented programming language that makes light bend reactive and helps developers write code that's more concise than other options.  Scala is used outside of its killer-app domain as well, of course, and certainly for a while there was a hype about the language that meant that even if the problem at hand could easily be solved in Java, Scala would still be the preference, as the language was seen as a future replacement for Java.  It reduces the amount of code developers to write the code.
  • 16. Getting Started in Scala: •scala –Runs compiled scala code –Or without arguments, as an interpreter! •scalac - compiles • fsc - compiles faster! (uses a background server to minimize startup time) •Go to scala-lang.org for downloads/documentation •Read Scala: A Scalable Language (see http://www.artima.com/scalazine/articles/scalable-language.html )
  • 17. Example programs with Scala and Python WordCount:In this example we use few transformations to build a dataset of (String, int) pairs called counts and save it to a file. Python: text_file = sc.textFile("hdfs://...") # Read data from file counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") #Save output into the file Scala: val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://... ")
  • 18. Prediction with Regressions utilizing MLlib Python: #Read data into dataframe df = sqlContext.createDataFrame(data, ["label","features"]) lr = LogisticRegression(maxIter=10) #Fit the model to the data model = lr.fit(df) # Predicting the results using the model model.transform(df).show() Scala: // Read data into dataframe val df = sqlContext.createDataFrame(data).toDF("lab el", "features") val lr = new LogisticRegression().setMaxIter(10) val model = lr.fit(df) val weights = model.weights // Predicting the results using the model