SlideShare a Scribd company logo
Spark Presentation
Overview
• Its a fast general-purpose engine for large-scale data
processing.
• Speed
• Ease of use
• Generality
• Runs everywhere
2
Overview
• Spark SQL for querying structured data via SQL and Hive Query Language (HQL)
• Spark Streaming mainly enables you to create analytical and interactive applications for live
streaming data. You can do the streaming of the data and then, Spark can run its operations
from the streamed data itself.
• MLLib is a machine learning library that is built on top of Spark, and has the provision to
support many machine learning algorithms. But the point difference is that it runs almost
100 times faster than MapReduce.
• Spark has its own Graph Computation Engine, called GraphX.
• Spark Core Engine allows writing raw Spark programs and Scala programs and launch them;
it also allows writing Java programs before launching them. All these are being executed by
Spark Core Engine.
3
Downloading and Getting started
• Download at http://spark.apache.org/downloads.html
• Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR
file. Unpack the tar file after downloading
• Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data
interactively.
• It is available in either Scala (which runs on the Java VM and is thus a good way to use existing
Java libraries) or Python.
• Spark shell can be opened by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ”
for Python Version
• Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be
created from Hadoop InputFormats (such as HDFS files) or by transforming other
Datasets.(Until Spark 2.0 it used to be RDD’s)
4
Spark Application Overview
• Like MapReduce applications, each Spark application is a self-contained computation that runs
user-supplied code to compute a result, however spark has many advantages over MapReduce
• In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit
of computation is an application.
• A Spark application can be used for a single batch job, an interactive session with multiple jobs, or
a long-lived server continually satisfying requests.
• MapReduce starts a process for each task. In contrast, a Spark application can have processes
running on its behalf even when it's not running a job
• Multiple tasks can run within the same executor. Both combine to enable extremely fast task
startup time as well as in-memory data storage, resulting in orders of magnitude faster
performance over MapReduce.
• Spark application execution involves runtime concepts such as driver, executor, task , job and
stage
5
Simple Spark Applications
6
There are several programming languages that Spark supports. Written in Scala, it already comes with
capabilities to be programmed in it, however it can also run using Java or Python
• Spark provides primarily two abstractions in its applications:
• RDD (Resilient Distributed Dataset) --- (Dataset in newest version)
• Two types of Shared Variables in parallel Operations
• Broadcast variables which can be stored in the cache of each system
• Accumulators which can help with aggregation functions such as addition
• To initiate a Spark application, the user must call the SparkConfig class to provide information about the
Spark instance, for example location of the nodes (Local or Remote), if Local, single or multi-thread
• Several Mapping and Reducer functions exist readily available including sample, which takes a random
sample, join, and sortbykey
Simple Spark Applications Continued
7
Accumulator: Stores variable that can have
additive and cumulative functions performed onto
it. Safer than using “Global” declaration
Parallelize: Stores iterable data such as a list onto
a distributed dataset to be sent to clusters on
network
Broadcast: Sends variable to other memory
devices in cluster. Efficient algorithm pre-built that
helps reduce communication costs. Should only be
used if the same data is needed in all nodes
Foreach (Reducer): Runs function on each
element in dataset, output is best to be an
accumulator object for safe updates
PySpark
● PySpark in an interactive shell for Spark in Python.
● The bin/pyspark script launches a Python interpreter that is configured to run PySpark
applications
● The PySpark supports Python 2.6 and above.
● The Python shell can be used explore data interactively and is a simple way to learn the API:
8
PySpark
● By default, the bin/pyspark shell creates SparkContext that runs applications.
● In PySpark, RDDs(Resilient Distributed Dataset) support the same methods as their Scala
counterparts but take Python functions and return Python collection types.
● Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-
tolerant collection of elements that can be operated on in parallel. There are two ways to create
RDDs:
9
Introduction to Scala
• A scalable programming language Influenced by Haskell and Java
• Can use any Java code in Scala, making it almost as fast as Java but with much shorter
code
• Allows fewer errors – no Null Pointer errors
• More flexible - Every Scala function is a value, every value is an object
• Scala Interpreter is an interactive shell for writing expression
10
Scala
• Scala smoothly integrates object oriented and functional programming
• Every value is an object
• Functions are first class values
• Runs on the JVM
• Scala is compatible with java. Java libraries and frameworks can be used without glue code or
additional declarations.
• Many design patterns are already natively supported
• It allows decomposition of objects by pattern matching.
• Patterns and expressions are generalized to support the natural treatment of XML documents
11

More Related Content

PPT
An Introduction to Apache spark with scala
PPTX
Apache Spark Fundamentals
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Introduction to Spark - DataFactZ
PDF
Let's start with Spark
PDF
Introduction to apache spark and the architecture
PPTX
In Memory Analytics with Apache Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
An Introduction to Apache spark with scala
Apache Spark Fundamentals
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Introduction to Spark - DataFactZ
Let's start with Spark
Introduction to apache spark and the architecture
In Memory Analytics with Apache Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

Similar to spark example spark example spark examplespark examplespark examplespark example (20)

PPTX
Spark core
PDF
A Deep Dive Into Spark
PPTX
Apache Spark for Beginners
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PDF
SparkPaper
PDF
Fast Data Analytics with Spark and Python
PPTX
APACHE SPARK.pptx
PPTX
Apache spark
PDF
Apache spark
PDF
Apache Spark Tutorial
PDF
Apache spark-the-definitive-guide-excerpts-r1
PPTX
Apache Spark
PPTX
SPARK ARCHITECTURE
PDF
20150716 introduction to apache spark v3
PPTX
Apache Spark in Industry
PDF
Apache Spark - A High Level overview
PPTX
Apache Spark Core
Spark core
A Deep Dive Into Spark
Apache Spark for Beginners
Big_data_analytics_NoSql_Module-4_Session
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
SparkPaper
Fast Data Analytics with Spark and Python
APACHE SPARK.pptx
Apache spark
Apache spark
Apache Spark Tutorial
Apache spark-the-definitive-guide-excerpts-r1
Apache Spark
SPARK ARCHITECTURE
20150716 introduction to apache spark v3
Apache Spark in Industry
Apache Spark - A High Level overview
Apache Spark Core
Ad

More from ShidrokhGoudarzi1 (9)

PPT
strata spark streaming strata spark streamingsrata spark streaming
PDF
Fusion_Conference_final.pdf
PDF
s12911-022-01756-2.pdf
PDF
Data-Driven_Healthcare_Challenges_and_Opportunities_for_Interactive_Visualiza...
PDF
21568674 - Masum (1) Copy.pdf
PDF
1-s2.0-S0166497222000293-main.pdf
PDF
Analytics_in_a_Big_Data_World_The_Essential_Guide_..._----_(Analytics_in_a_Bi...
DOC
Literature Review Matrix Word-Template 3.doc
PDF
Literature_Review_Matrix_PDF 4.pdf
strata spark streaming strata spark streamingsrata spark streaming
Fusion_Conference_final.pdf
s12911-022-01756-2.pdf
Data-Driven_Healthcare_Challenges_and_Opportunities_for_Interactive_Visualiza...
21568674 - Masum (1) Copy.pdf
1-s2.0-S0166497222000293-main.pdf
Analytics_in_a_Big_Data_World_The_Essential_Guide_..._----_(Analytics_in_a_Bi...
Literature Review Matrix Word-Template 3.doc
Literature_Review_Matrix_PDF 4.pdf
Ad

Recently uploaded (20)

PDF
Data Science Trends & Career Guide---ppt
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPT
Quality review (1)_presentation of this 21
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Understanding Prototyping in Design and Development
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Global journeys: estimating international migration
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data Science Trends & Career Guide---ppt
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Quality review (1)_presentation of this 21
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Understanding Prototyping in Design and Development
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Logistic Regression ml machine learning.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Moving the Public Sector (Government) to a Digital Adoption
Global journeys: estimating international migration
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

spark example spark example spark examplespark examplespark examplespark example

  • 2. Overview • Its a fast general-purpose engine for large-scale data processing. • Speed • Ease of use • Generality • Runs everywhere 2
  • 3. Overview • Spark SQL for querying structured data via SQL and Hive Query Language (HQL) • Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself. • MLLib is a machine learning library that is built on top of Spark, and has the provision to support many machine learning algorithms. But the point difference is that it runs almost 100 times faster than MapReduce. • Spark has its own Graph Computation Engine, called GraphX. • Spark Core Engine allows writing raw Spark programs and Scala programs and launch them; it also allows writing Java programs before launching them. All these are being executed by Spark Core Engine. 3
  • 4. Downloading and Getting started • Download at http://spark.apache.org/downloads.html • Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR file. Unpack the tar file after downloading • Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. • It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. • Spark shell can be opened by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ” for Python Version • Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets.(Until Spark 2.0 it used to be RDD’s) 4
  • 5. Spark Application Overview • Like MapReduce applications, each Spark application is a self-contained computation that runs user-supplied code to compute a result, however spark has many advantages over MapReduce • In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit of computation is an application. • A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. • MapReduce starts a process for each task. In contrast, a Spark application can have processes running on its behalf even when it's not running a job • Multiple tasks can run within the same executor. Both combine to enable extremely fast task startup time as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce. • Spark application execution involves runtime concepts such as driver, executor, task , job and stage 5
  • 6. Simple Spark Applications 6 There are several programming languages that Spark supports. Written in Scala, it already comes with capabilities to be programmed in it, however it can also run using Java or Python • Spark provides primarily two abstractions in its applications: • RDD (Resilient Distributed Dataset) --- (Dataset in newest version) • Two types of Shared Variables in parallel Operations • Broadcast variables which can be stored in the cache of each system • Accumulators which can help with aggregation functions such as addition • To initiate a Spark application, the user must call the SparkConfig class to provide information about the Spark instance, for example location of the nodes (Local or Remote), if Local, single or multi-thread • Several Mapping and Reducer functions exist readily available including sample, which takes a random sample, join, and sortbykey
  • 7. Simple Spark Applications Continued 7 Accumulator: Stores variable that can have additive and cumulative functions performed onto it. Safer than using “Global” declaration Parallelize: Stores iterable data such as a list onto a distributed dataset to be sent to clusters on network Broadcast: Sends variable to other memory devices in cluster. Efficient algorithm pre-built that helps reduce communication costs. Should only be used if the same data is needed in all nodes Foreach (Reducer): Runs function on each element in dataset, output is best to be an accumulator object for safe updates
  • 8. PySpark ● PySpark in an interactive shell for Spark in Python. ● The bin/pyspark script launches a Python interpreter that is configured to run PySpark applications ● The PySpark supports Python 2.6 and above. ● The Python shell can be used explore data interactively and is a simple way to learn the API: 8
  • 9. PySpark ● By default, the bin/pyspark shell creates SparkContext that runs applications. ● In PySpark, RDDs(Resilient Distributed Dataset) support the same methods as their Scala counterparts but take Python functions and return Python collection types. ● Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault- tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: 9
  • 10. Introduction to Scala • A scalable programming language Influenced by Haskell and Java • Can use any Java code in Scala, making it almost as fast as Java but with much shorter code • Allows fewer errors – no Null Pointer errors • More flexible - Every Scala function is a value, every value is an object • Scala Interpreter is an interactive shell for writing expression 10
  • 11. Scala • Scala smoothly integrates object oriented and functional programming • Every value is an object • Functions are first class values • Runs on the JVM • Scala is compatible with java. Java libraries and frameworks can be used without glue code or additional declarations. • Many design patterns are already natively supported • It allows decomposition of objects by pattern matching. • Patterns and expressions are generalized to support the natural treatment of XML documents 11