spark example spark example spark examplespark examplespark examplespark example

Overview
• Its a fast general-purpose engine for large-scale data
processing.
• Speed
• Ease of use
• Generality
• Runs everywhere
2

Overview
• Spark SQL for querying structured data via SQL and Hive Query Language (HQL)
• Spark Streaming mainly enables you to create analytical and interactive applications for live
streaming data. You can do the streaming of the data and then, Spark can run its operations
from the streamed data itself.
• MLLib is a machine learning library that is built on top of Spark, and has the provision to
support many machine learning algorithms. But the point difference is that it runs almost
100 times faster than MapReduce.
• Spark has its own Graph Computation Engine, called GraphX.
• Spark Core Engine allows writing raw Spark programs and Scala programs and launch them;
it also allows writing Java programs before launching them. All these are being executed by
Spark Core Engine.
3

Downloading and Getting started
• Download at http://spark.apache.org/downloads.html
• Select package type as “Pre-built for Hadoop 2.7 and later” and download the compressed TAR
file. Unpack the tar file after downloading
• Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data
interactively.
• It is available in either Scala (which runs on the Java VM and is thus a good way to use existing
Java libraries) or Python.
• Spark shell can be opened by typing “./bin/spark-shell” for Scala version and “./bin/pyspark ”
for Python Version
• Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be
created from Hadoop InputFormats (such as HDFS files) or by transforming other
Datasets.(Until Spark 2.0 it used to be RDD’s)
4

Spark Application Overview
• Like MapReduce applications, each Spark application is a self-contained computation that runs
user-supplied code to compute a result, however spark has many advantages over MapReduce
• In MapReduce, the highest-level unit of computation is a job while In Spark, the highest-level unit
of computation is an application.
• A Spark application can be used for a single batch job, an interactive session with multiple jobs, or
a long-lived server continually satisfying requests.
• MapReduce starts a process for each task. In contrast, a Spark application can have processes
running on its behalf even when it's not running a job
• Multiple tasks can run within the same executor. Both combine to enable extremely fast task
startup time as well as in-memory data storage, resulting in orders of magnitude faster
performance over MapReduce.
• Spark application execution involves runtime concepts such as driver, executor, task , job and
stage
5

Simple Spark Applications
6
There are several programming languages that Spark supports. Written in Scala, it already comes with
capabilities to be programmed in it, however it can also run using Java or Python
• Spark provides primarily two abstractions in its applications:
• RDD (Resilient Distributed Dataset) --- (Dataset in newest version)
• Two types of Shared Variables in parallel Operations
• Broadcast variables which can be stored in the cache of each system
• Accumulators which can help with aggregation functions such as addition
• To initiate a Spark application, the user must call the SparkConfig class to provide information about the
Spark instance, for example location of the nodes (Local or Remote), if Local, single or multi-thread
• Several Mapping and Reducer functions exist readily available including sample, which takes a random
sample, join, and sortbykey

Simple Spark Applications Continued
7
Accumulator: Stores variable that can have
additive and cumulative functions performed onto
it. Safer than using “Global” declaration
Parallelize: Stores iterable data such as a list onto
a distributed dataset to be sent to clusters on
network
Broadcast: Sends variable to other memory
devices in cluster. Efficient algorithm pre-built that
helps reduce communication costs. Should only be
used if the same data is needed in all nodes
Foreach (Reducer): Runs function on each
element in dataset, output is best to be an
accumulator object for safe updates

PySpark
● PySpark in an interactive shell for Spark in Python.
● The bin/pyspark script launches a Python interpreter that is configured to run PySpark
applications
● The PySpark supports Python 2.6 and above.
● The Python shell can be used explore data interactively and is a simple way to learn the API:
8

PySpark
● By default, the bin/pyspark shell creates SparkContext that runs applications.
● In PySpark, RDDs(Resilient Distributed Dataset) support the same methods as their Scala
counterparts but take Python functions and return Python collection types.
● Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-
tolerant collection of elements that can be operated on in parallel. There are two ways to create
RDDs:
9

Introduction to Scala
• A scalable programming language Influenced by Haskell and Java
• Can use any Java code in Scala, making it almost as fast as Java but with much shorter
code
• Allows fewer errors – no Null Pointer errors
• More flexible - Every Scala function is a value, every value is an object
• Scala Interpreter is an interactive shell for writing expression
10

Scala
• Scala smoothly integrates object oriented and functional programming
• Every value is an object
• Functions are first class values
• Runs on the JVM
• Scala is compatible with java. Java libraries and frameworks can be used without glue code or
additional declarations.
• Many design patterns are already natively supported
• It allows decomposition of objects by pattern matching.
• Patterns and expressions are generalized to support the natural treatment of XML documents
11

spark example spark example spark examplespark examplespark examplespark example

More Related Content

Similar to spark example spark example spark examplespark examplespark examplespark example (20)

More from ShidrokhGoudarzi1 (9)

Recently uploaded (20)

spark example spark example spark examplespark examplespark examplespark example