SlideShare a Scribd company logo
Introduction to Apache
Spark
Spark - Intro
 A fast and general engine for large-scale data
processing.
 Scalable architecuture
 Work with cluster manager (such as YARN)
Spark Context
Spark Context
 SparkContext is the entry point to any spark functionality.
 As soon as you run a spark application, a driver program
starts, which has the main function and the sparkcontext
gets initiated.
 The driver program runs the operations inside the
executors on the worker nodes.
Spark Context
 SparkContext uses Py4J to launch a JVM and creates a
JavaSparkContext.
 Spark supports Scala, Java and Python. PySpark is the
library to be installed to use python code snippets.
 PySpark has a default SparkContext library. This helps to
read a local file from the system and process it using
Spark.
Introduction to apache spark and the architecture
Sample program
SparkShell
 Simple interactive REPL (Read-Eval-Print-Loop).
 Provides a simple way to connect and analyze data
interactively.
 Can be started using pyspark or spark-shell command in
terminal. The former supports python based programs
and the latter supports scala based programs.
SparkShell
Features
 Runs programs up to 100x faster than Hadoop
Mapreduce in memory or 10x faster in disk.
 DAG engine – a directed acyclic graph is created that
optimzes workflows.
 Lot of big players like amazon, eBay, NASA Deep space
network, etc. use Spark.
 Built around one main concept: Resilient Distributed
Dataset (RDD).
Components of spark
RDD – Resilient Distributed Datasets
 This is the core object on which the spark revolves
including SparkSQL, MLLib, etc.
 Similar to pandas dataframes.
 RDD can run on standalone systems or a cluster.
 It is created by the sparkcontext object.
Creating RDDs
 Nums = sc.parallelize([1,2,3,4])
 sc.textFile(“file:///users/....txt”)
 Or from s3n:// or hdfs://
 HiveCtx = HiveContext(sc)
 Can also be created from
 JDBC, HBase, JSON, CSV, etc.
Operations on RDDs
 Map
 Filter
 Distinct
 Sample
 Union, intersection, subtract, cartesian
RDD actions
 collect
 count
 countByValue
 reduce
 Etc...
 Nothing actually happens in the driver program until an
action is called.! - Lazy Evaluation

More Related Content

PPTX
Spark core
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
spark example spark example spark examplespark examplespark examplespark example
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Apache Spark Fundamentals
PDF
Apache Spark Tutorial
PDF
Introduction to Apache Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark core
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
spark example spark example spark examplespark examplespark examplespark example
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Apache Spark Fundamentals
Apache Spark Tutorial
Introduction to Apache Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

Similar to Introduction to apache spark and the architecture (20)

PPT
An Introduction to Apache spark with scala
PDF
Let's start with Spark
PDF
Fast Data Analytics with Spark and Python
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PDF
Apache spark
PPTX
Apache Spark II (SparkSQL)
PPTX
Introduction to Apache Spark
PPTX
Apache spark
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
Spark infrastructure
PDF
Yet another intro to Apache Spark
PPTX
In Memory Analytics with Apache Spark
PPTX
Apache Spark Core
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
A Deep Dive Into Spark
PDF
Spark: A Unified Engine for Big Data Processing
PDF
Spark and scala course content | Spark and scala course online training
PPTX
APACHE SPARK.pptx
PPTX
Spark from the Surface
PPTX
Apache Spark Overview
An Introduction to Apache spark with scala
Let's start with Spark
Fast Data Analytics with Spark and Python
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Apache spark
Apache Spark II (SparkSQL)
Introduction to Apache Spark
Apache spark
Spark Concepts - Spark SQL, Graphx, Streaming
Spark infrastructure
Yet another intro to Apache Spark
In Memory Analytics with Apache Spark
Apache Spark Core
Unit II Real Time Data Processing tools.pptx
A Deep Dive Into Spark
Spark: A Unified Engine for Big Data Processing
Spark and scala course content | Spark and scala course online training
APACHE SPARK.pptx
Spark from the Surface
Apache Spark Overview
Ad

Recently uploaded (20)

PDF
Sports Quiz easy sports quiz sports quiz
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
01-Introduction-to-Information-Management.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
master seminar digital applications in india
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Cell Types and Its function , kingdom of life
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Complications of Minimal Access Surgery at WLH
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Sports Quiz easy sports quiz sports quiz
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharmacology of Heart Failure /Pharmacotherapy of CHF
01-Introduction-to-Information-Management.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
master seminar digital applications in india
human mycosis Human fungal infections are called human mycosis..pptx
RMMM.pdf make it easy to upload and study
Cell Types and Its function , kingdom of life
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
FourierSeries-QuestionsWithAnswers(Part-A).pdf
VCE English Exam - Section C Student Revision Booklet
Complications of Minimal Access Surgery at WLH
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Structure & Organelles in detailed.
Renaissance Architecture: A Journey from Faith to Humanism
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Ad

Introduction to apache spark and the architecture

  • 2. Spark - Intro  A fast and general engine for large-scale data processing.  Scalable architecuture  Work with cluster manager (such as YARN)
  • 4. Spark Context  SparkContext is the entry point to any spark functionality.  As soon as you run a spark application, a driver program starts, which has the main function and the sparkcontext gets initiated.  The driver program runs the operations inside the executors on the worker nodes.
  • 5. Spark Context  SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.  Spark supports Scala, Java and Python. PySpark is the library to be installed to use python code snippets.  PySpark has a default SparkContext library. This helps to read a local file from the system and process it using Spark.
  • 8. SparkShell  Simple interactive REPL (Read-Eval-Print-Loop).  Provides a simple way to connect and analyze data interactively.  Can be started using pyspark or spark-shell command in terminal. The former supports python based programs and the latter supports scala based programs.
  • 10. Features  Runs programs up to 100x faster than Hadoop Mapreduce in memory or 10x faster in disk.  DAG engine – a directed acyclic graph is created that optimzes workflows.  Lot of big players like amazon, eBay, NASA Deep space network, etc. use Spark.  Built around one main concept: Resilient Distributed Dataset (RDD).
  • 12. RDD – Resilient Distributed Datasets  This is the core object on which the spark revolves including SparkSQL, MLLib, etc.  Similar to pandas dataframes.  RDD can run on standalone systems or a cluster.  It is created by the sparkcontext object.
  • 13. Creating RDDs  Nums = sc.parallelize([1,2,3,4])  sc.textFile(“file:///users/....txt”)  Or from s3n:// or hdfs://  HiveCtx = HiveContext(sc)  Can also be created from  JDBC, HBase, JSON, CSV, etc.
  • 14. Operations on RDDs  Map  Filter  Distinct  Sample  Union, intersection, subtract, cartesian
  • 15. RDD actions  collect  count  countByValue  reduce  Etc...  Nothing actually happens in the driver program until an action is called.! - Lazy Evaluation