SlideShare a Scribd company logo
MODULE 2
Installing Spark
• High-level physical cluster architecture
• Software architecture of a standalone cluster
• Installing Spark standalone locally
• Running the Spark Shell
• Running in python shell and ipython shell
• Running sample Spark code
• Using the Spark Session and Spark Context
• Creating a parallelized collection
Installing Spark
Module 2. Installing Spark 2
What we’ll cover:
Spark Cluster Components
Module 2. Installing Spark 3
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager
Worker Node
Executor
● Cache
● Task
● Task
● ...
• Driver Node on the Cluster Manager, aka Master Node
• Driver Node in a Spark Cluster
• Driver Program
• Driver Process
“Driver”
4Module 2. Installing Spark
• Cluster Mode
o submit application to cluster manager
• Client Mode
o run the driver outside the cluster, with executors in the cluster
• Standalone Mode
o everything is on one machine
Execution Modes
Module 3. Spark Computing Framework 5
Standalone Spark Cluster
Module 2. Installing Spark 6
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager
• Ensure that you have Java installed, and verify your version
• Most recent version you can install with Java 7: version 2.1.x
• If you have Java 8+ installed you can install version 2.2.x
Installing Local Standalone
Module 2. Installing Spark 7
• Work within a virtual environment or container
• If you are already familiar with this you can skip the next three slides
Recommendation for Success
Module 2. Installing Spark 8
• This is not required but is strongly recommended.
• a virtual environment creates an isolated environment for a project
o each project can have its own dependencies and not affect other projects
o you can easily and safely remove this environment when you are done with it
• Examples include Virtualenv, Anaconda, but there are many to good options
• You may also use a container-based environment, such as Docker.
• Use your preferred approach or organization’s recommended practice
Creating a virtual environment
Module 2. Installing Spark 9
Example : creating a virtual environment.
Anaconda
$ conda create -n my-env python
$ source activate my-env
# Here is the same thing, but specifying Python 2.7
$ conda create -n my-env python=2.7
$ source activate my-env
# Here is the same thing, but with Python 3.6
$ conda create -n my-env python=3.6
$ source activate my-env
Module 2. Installing Spark 10
Example : creating a virtual environment.
Virtualenv
$ cd /path/to/my/venvs
$ virtualenv ./my_venv
$ source ./my_env/bin/activate
Module 2. Installing Spark 11
• The most recent version of Pyspark you can install with Java 7 is version 2.1.x
• If you have Java 8+ installed you can install Pyspark 2.2.x
Install Java
Module 2. Installing Spark 12
The simple way to install Pyspark
Recommended for Spark 2.2.x onwards
$ pip install pyspark
Module 2. Installing Spark 13
• Download source and Build
o Browse to https://spark.apache.org/downloads.html
o Browse https://spark.apache.org/downloads.html
o Choose the following settings :
Installing Pyspark 2.1.2
Module 2. Installing Spark 14
o Click on spark-2.1.2.tgz
Installing Pyspark 2.1.2
Module 2. Installing Spark 15
o Save this to /ext/spark
Installing Pyspark 2.1.2
Recommended for following along in this course
$ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark-
2.1.2.tgz
# Confirm that the file size is as expected (~13MiB)
$ ls -lh spark-2.1.2.tgz
# Extract the contents
$ tar -xf spark-2.1.2.tgz
$ cd /ext/spark/spark-2.1.2
Module 2. Installing Spark 16
• Set Java home environment variable
o This is necessary only if Java home is not already set.
Installing Pyspark 2.1.2
Module 2. Installing Spark 17
Installing Pyspark 2.1.2
Set $JAVA_HOME
$ echo $JAVA_HOME
# If above returns empty, then you’ll need to set it. On Mac OS, you can use result of :
$ echo `/usr/libexec/java_home`
like so :
$ export JAVA_HOME=`/usr/libexec/java_home`
Module 2. Installing Spark 18
• Get the build instructions
o It is good practice to vet these yourself before running.
o Understand what you are going to do before doing it
o Spark evolves rapidly, so know how to upgrade!
• Browse to : https://spark.apache.org/documentation.html
o Scroll down to your version, click into that link
o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html
o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.”
o “Note that support for Java 7 was removed as of Spark 2.2.0.”
• Now to build it
Installing Pyspark 2.1.2
Module 2. Installing Spark 19
Installing Pyspark 2.1.2
Build it
$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
$ ./build/mvn -DskipTests clean package
[INFO] Total time: 11:11 min
$ sudo pip install py4j
Module 2. Installing Spark 20
Installing Pyspark 2.1.2
Pro tip : turn down overly verbose logs
$ cd /ext/spark/spark-2.1.2
$ cp conf/log4j.properties.template conf/log4j.properties
Module 2. Installing Spark 21
• Pro tip: ensure the following lines are in ~/.bash_profile
o export SPARK_HOME="/ext/spark/spark-2.1.2"
o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH
• If you altered ~/.bash_profile, remember to run the following:
o $ source ~/.bash_profile
Done Installing Spark
Module 2. Installing Spark 22
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>>
Module 2. Installing Spark 23
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 24
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 25
Running Pyspark Shell
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']
Module 2. Installing Spark 26
Running Pyspark in python shell
$ python
>>>
Module 2. Installing Spark 27
Running Pyspark in python shell
We’ll simplify this later -- for now let’s ensure we’re using what we just installed
$ python
>>> import sys
>>> import os
>>> SPARK_HOME = '/ext/spark/spark-2.1.2'
>>> SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
>>> sys.path.insert(0, os.path.join(SPARK_HOME, "python")) # precede pre-existing
>>> sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J))
>>> os.environ["SPARK_HOME"] = SPARK_HOME
>>>
Module 2. Installing Spark 28
Running Pyspark in python shell
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession 
... .builder 
... .appName("Python Spark SQL basic example") 
... .config("spark.some.config.option", "some-value") 
... .getOrCreate()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> spark
<pyspark.sql.session.SparkSession object at 0x1112e4650>
Module 2. Installing Spark 29
Running Pyspark in python shell
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']
Module 2. Installing Spark 30
Running Pyspark in ipython shell
Repeat the steps we did in the python shell :
$ ipython
In [1]: import os
...: import sys
...: SPARK_HOME = '/ext/spark/spark-2.1.2'
...: SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
...: sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
...: sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) # must precede IDE's py4j
...:
Module 2. Installing Spark 31
Running Pyspark in ipython shell
In [2]: java_home = "/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home"
...: os.environ["JAVA_HOME"] = java_home
...: os.environ["SPARK_HOME"] = SPARK_HOME
...: from pyspark.sql import SparkSession
...:
Module 2. Installing Spark 32
Running Pyspark in ipython shell
In [3]: spark = SparkSession 
...: .builder 
...: .appName("Python Spark SQL basic example") 
...: .config("spark.some.config.option", "some-value") 
...: .getOrCreate()
...:
Module 2. Installing Spark 33
Running Pyspark in ipython shell
In [4]: spark?
Type: SparkSession
String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10>
File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py
Docstring:
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:
[...]
Module 2. Installing Spark 34
Running Pyspark in ipython shell
Running the example code that we ran previously
In [5]: sc = spark.sparkContext
...: rdd = sc.parallelize(["Hello", "World"])
...: print(rdd.count())
...: print(rdd.collect())
...: print(rdd.take(2))
...:
2
['Hello', 'World']
['Hello', 'World']
Module 2. Installing Spark 35
Running Pyspark in ipython shell
In [6]: sc?
Type: SparkContext
String form: <pyspark.context.SparkContext object at 0x1111702d0>
File: /ext/spark/spark-2.1.2/python/pyspark/context.py
Docstring:
Main entry point for Spark functionality. A SparkContext represents the
connection to a Spark cluster, and can be used to create L{RDD} and
broadcast variables on that cluster.
Init docstring:
Create a new SparkContext. At least the master and app name should be set,
either through the named parameters here or through C{conf}.
Module 2. Installing Spark 36
Running Pyspark in ipython shell
In [7]: rdd?
Type: RDD
String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475
File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py
Docstring:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.
Module 2. Installing Spark 37
Running Pyspark in ipython shell
Tab completion
T
In [8]: rdd.
rdd.aggregate rdd.coalesce rdd.context rdd.countByValue
rdd.aggregateByKey rdd.cogroup rdd.count rdd.ctx
rdd.cache rdd.collect rdd.countApprox rdd.distinct
rdd.cartesian rdd.collectAsMap rdd.countApproxDistinct rdd.filter
rdd.checkpoint rdd.combineByKey rdd.countByKey rdd.first
Module 2. Installing Spark 38
Running Pyspark in ipython shell
Magic functions
T
In [9]: %whos
Variable Type Data/Info
----------------------------------------
SPARK_HOME str /ext/spark/spark-2.1.2
SPARK_PY4J str python/lib/py4j-0.10.4-src.zip
SparkSession type <class 'pyspark.sql.session.SparkSession'>
java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home
rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475
sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0>
spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10>
Module 2. Installing Spark 39
• We learned several ways of installing Pyspark
• We ran the Spark Shell
• We showed how to run our new install in python and ipython
• We tested Pyspark by running sample code in all three of these shells
• We used two important objects: the Spark Session, and the Spark Context
• We created a RDD and inspected its contents.
What we covered
Module 2. Installing Spark 40
• Next time we’ll cover more ways of running Spark.
• We’ll show it in an IDE, and notebook.
• We’ll get our first view of the Spark UI
Next time: Running Spark
Module 2. Installing Spark 41
Ad

Recommended

Deploy Mediawiki Using FIWARE Lab Facilities
Deploy Mediawiki Using FIWARE Lab Facilities
FIWARE
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2
Fernando Lopez Aguilar
 
Deploy MediaWiki usgin Fiware Lab Facilities
Deploy MediaWiki usgin Fiware Lab Facilities
José Ignacio Carretero Guarde
 
Automated malware analysis
Automated malware analysis
Ibrahim Baliç
 
GIT, RVM, FIRST HEROKU APP
GIT, RVM, FIRST HEROKU APP
Pavel Tyk
 
Docker security
Docker security
Janos Suto
 
zookeeperProgrammers
zookeeperProgrammers
Hiroshi Ono
 
RDO-Packstack Workshop
RDO-Packstack Workshop
Thamrongtawal Hashim
 
Simple docker hosting in FIWARE Lab
Simple docker hosting in FIWARE Lab
Fernando Lopez Aguilar
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
Dale Lane
 
Jenkins log monitoring with elk stack
Jenkins log monitoring with elk stack
Subhasis Roy
 
Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點
William Yeh
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Puppet
 
Useful Kafka tools
Useful Kafka tools
Dale Lane
 
“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.
Graham Dumpleton
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
Graham Dumpleton
 
Velocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack Workshop
Chef Software, Inc.
 
Ansible101
Ansible101
Hideki Saito
 
How to master OpenStack in 2 hours
How to master OpenStack in 2 hours
OpenCity Community
 
Zookeeper In Action
Zookeeper In Action
juvenxu
 
Zookeeper Introduce
Zookeeper Introduce
jhao niu
 
AWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configuration
Lionel LONKAP TSAMBA
 
Cooking Perl with Chef
Cooking Perl with Chef
David Golden
 
Build your own private openstack cloud
Build your own private openstack cloud
NUTC, imac
 
Deploying OpenStack with Chef
Deploying OpenStack with Chef
Matt Ray
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
NUTC, imac
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Maciej Lasyk
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 

More Related Content

What's hot (20)

Simple docker hosting in FIWARE Lab
Simple docker hosting in FIWARE Lab
Fernando Lopez Aguilar
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
Dale Lane
 
Jenkins log monitoring with elk stack
Jenkins log monitoring with elk stack
Subhasis Roy
 
Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點
William Yeh
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Puppet
 
Useful Kafka tools
Useful Kafka tools
Dale Lane
 
“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.
Graham Dumpleton
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
Graham Dumpleton
 
Velocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack Workshop
Chef Software, Inc.
 
Ansible101
Ansible101
Hideki Saito
 
How to master OpenStack in 2 hours
How to master OpenStack in 2 hours
OpenCity Community
 
Zookeeper In Action
Zookeeper In Action
juvenxu
 
Zookeeper Introduce
Zookeeper Introduce
jhao niu
 
AWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configuration
Lionel LONKAP TSAMBA
 
Cooking Perl with Chef
Cooking Perl with Chef
David Golden
 
Build your own private openstack cloud
Build your own private openstack cloud
NUTC, imac
 
Deploying OpenStack with Chef
Deploying OpenStack with Chef
Matt Ray
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
NUTC, imac
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Maciej Lasyk
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
Dale Lane
 
Jenkins log monitoring with elk stack
Jenkins log monitoring with elk stack
Subhasis Roy
 
Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點
William Yeh
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Puppet
 
Useful Kafka tools
Useful Kafka tools
Dale Lane
 
“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.
Graham Dumpleton
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
Graham Dumpleton
 
Velocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack Workshop
Chef Software, Inc.
 
How to master OpenStack in 2 hours
How to master OpenStack in 2 hours
OpenCity Community
 
Zookeeper In Action
Zookeeper In Action
juvenxu
 
Zookeeper Introduce
Zookeeper Introduce
jhao niu
 
AWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configuration
Lionel LONKAP TSAMBA
 
Cooking Perl with Chef
Cooking Perl with Chef
David Golden
 
Build your own private openstack cloud
Build your own private openstack cloud
NUTC, imac
 
Deploying OpenStack with Chef
Deploying OpenStack with Chef
Matt Ray
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
NUTC, imac
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Maciej Lasyk
 

Similar to Apache Spark SQL- Installing Spark (20)

Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Pyspark tutorial
Pyspark tutorial
HarikaReddy115
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OS
Universiti Technologi Malaysia (UTM)
 
Spark with HDInsight
Spark with HDInsight
Khalid Salama
 
Final Report - Spark
Final Report - Spark
Syed Danyal Khaliq
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Spark Intro by Adform Research
Spark Intro by Adform Research
Vasil Remeniuk
 
Spark intro by Adform Research
Spark intro by Adform Research
Vasil Remeniuk
 
Introduction to apache spark and the architecture
Introduction to apache spark and the architecture
sundharakumarkb2
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Spark core
Spark core
Prashant Gupta
 
Introduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Apache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Spark with HDInsight
Spark with HDInsight
Khalid Salama
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
Databricks
 
Spark Intro by Adform Research
Spark Intro by Adform Research
Vasil Remeniuk
 
Spark intro by Adform Research
Spark intro by Adform Research
Vasil Remeniuk
 
Introduction to apache spark and the architecture
Introduction to apache spark and the architecture
sundharakumarkb2
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Apache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Ad

More from Experfy (20)

Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life Insurance
Experfy
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Experfy
 
Graph Models for Deep Learning
Graph Models for Deep Learning
Experfy
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial
Experfy
 
Machine Learning in AI
Machine Learning in AI
Experfy
 
A Gentle Introduction to Genomics
A Gentle Introduction to Genomics
Experfy
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTech
Experfy
 
Health Insurance 101
Health Insurance 101
Experfy
 
Financial Derivatives
Financial Derivatives
Experfy
 
AI for executives
AI for executives
Experfy
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Experfy
 
Microsoft Azure Power BI
Microsoft Azure Power BI
Experfy
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
Experfy
 
Sales Forecasting
Sales Forecasting
Experfy
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
Experfy
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics
Experfy
 
Blockchain Technology Fundamentals
Blockchain Technology Fundamentals
Experfy
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Experfy
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and Applications
Experfy
 
Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life Insurance
Experfy
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Experfy
 
Graph Models for Deep Learning
Graph Models for Deep Learning
Experfy
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial
Experfy
 
Machine Learning in AI
Machine Learning in AI
Experfy
 
A Gentle Introduction to Genomics
A Gentle Introduction to Genomics
Experfy
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTech
Experfy
 
Health Insurance 101
Health Insurance 101
Experfy
 
Financial Derivatives
Financial Derivatives
Experfy
 
AI for executives
AI for executives
Experfy
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Experfy
 
Microsoft Azure Power BI
Microsoft Azure Power BI
Experfy
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
Experfy
 
Sales Forecasting
Sales Forecasting
Experfy
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
Experfy
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics
Experfy
 
Blockchain Technology Fundamentals
Blockchain Technology Fundamentals
Experfy
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Experfy
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and Applications
Experfy
 
Ad

Recently uploaded (20)

Aprendendo Arquitetura Framework Salesforce - Dia 02
Aprendendo Arquitetura Framework Salesforce - Dia 02
Mauricio Alexandre Silva
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
AndrewBorisenko3
 
HistoPathology Ppt. Arshita Gupta for Diploma
HistoPathology Ppt. Arshita Gupta for Diploma
arshitagupta674
 
VCE Literature Section A Exam Response Guide
VCE Literature Section A Exam Response Guide
jpinnuck
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
June 2025 Progress Update With Board Call_In process.pptx
June 2025 Progress Update With Board Call_In process.pptx
International Society of Service Innovation Professionals
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 6-14-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
2025 Completing the Pre-SET Plan Form.pptx
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
How to use search fetch method in Odoo 18
How to use search fetch method in Odoo 18
Celine George
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
This is why students from these 44 institutions have not received National Se...
This is why students from these 44 institutions have not received National Se...
Kweku Zurek
 
IIT KGP Quiz Week 2024 Sports Quiz (Prelims + Finals)
IIT KGP Quiz Week 2024 Sports Quiz (Prelims + Finals)
IIT Kharagpur Quiz Club
 
GREAT QUIZ EXCHANGE 2025 - GENERAL QUIZ.pptx
GREAT QUIZ EXCHANGE 2025 - GENERAL QUIZ.pptx
Ronisha Das
 
List View Components in Odoo 18 - Odoo Slides
List View Components in Odoo 18 - Odoo Slides
Celine George
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
How to use _name_search() method in Odoo 18
How to use _name_search() method in Odoo 18
Celine George
 
LAZY SUNDAY QUIZ "A GENERAL QUIZ" JUNE 2025 SMC QUIZ CLUB, SILCHAR MEDICAL CO...
LAZY SUNDAY QUIZ "A GENERAL QUIZ" JUNE 2025 SMC QUIZ CLUB, SILCHAR MEDICAL CO...
Ultimatewinner0342
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 
Aprendendo Arquitetura Framework Salesforce - Dia 02
Aprendendo Arquitetura Framework Salesforce - Dia 02
Mauricio Alexandre Silva
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
CRYPTO TRADING COURSE BY FINANCEWORLD.IO
AndrewBorisenko3
 
HistoPathology Ppt. Arshita Gupta for Diploma
HistoPathology Ppt. Arshita Gupta for Diploma
arshitagupta674
 
VCE Literature Section A Exam Response Guide
VCE Literature Section A Exam Response Guide
jpinnuck
 
2025 June Year 9 Presentation: Subject selection.pptx
2025 June Year 9 Presentation: Subject selection.pptx
mansk2
 
How to Customize Quotation Layouts in Odoo 18
How to Customize Quotation Layouts in Odoo 18
Celine George
 
2025 Completing the Pre-SET Plan Form.pptx
2025 Completing the Pre-SET Plan Form.pptx
mansk2
 
How to use search fetch method in Odoo 18
How to use search fetch method in Odoo 18
Celine George
 
LDMMIA Yoga S10 Free Workshop Grad Level
LDMMIA Yoga S10 Free Workshop Grad Level
LDM & Mia eStudios
 
This is why students from these 44 institutions have not received National Se...
This is why students from these 44 institutions have not received National Se...
Kweku Zurek
 
IIT KGP Quiz Week 2024 Sports Quiz (Prelims + Finals)
IIT KGP Quiz Week 2024 Sports Quiz (Prelims + Finals)
IIT Kharagpur Quiz Club
 
GREAT QUIZ EXCHANGE 2025 - GENERAL QUIZ.pptx
GREAT QUIZ EXCHANGE 2025 - GENERAL QUIZ.pptx
Ronisha Das
 
List View Components in Odoo 18 - Odoo Slides
List View Components in Odoo 18 - Odoo Slides
Celine George
 
How to Manage Different Customer Addresses in Odoo 18 Accounting
How to Manage Different Customer Addresses in Odoo 18 Accounting
Celine George
 
How to use _name_search() method in Odoo 18
How to use _name_search() method in Odoo 18
Celine George
 
LAZY SUNDAY QUIZ "A GENERAL QUIZ" JUNE 2025 SMC QUIZ CLUB, SILCHAR MEDICAL CO...
LAZY SUNDAY QUIZ "A GENERAL QUIZ" JUNE 2025 SMC QUIZ CLUB, SILCHAR MEDICAL CO...
Ultimatewinner0342
 
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
ECONOMICS, DISASTER MANAGEMENT, ROAD SAFETY - STUDY MATERIAL [10TH]
SHERAZ AHMAD LONE
 

Apache Spark SQL- Installing Spark

  • 2. • High-level physical cluster architecture • Software architecture of a standalone cluster • Installing Spark standalone locally • Running the Spark Shell • Running in python shell and ipython shell • Running sample Spark code • Using the Spark Session and Spark Context • Creating a parallelized collection Installing Spark Module 2. Installing Spark 2 What we’ll cover:
  • 3. Spark Cluster Components Module 2. Installing Spark 3 Driver Program Spark Session ● Spark Context Worker Node Executor ● Cache ● Task ● Task ● ... Cluster Manager Worker Node Executor ● Cache ● Task ● Task ● ...
  • 4. • Driver Node on the Cluster Manager, aka Master Node • Driver Node in a Spark Cluster • Driver Program • Driver Process “Driver” 4Module 2. Installing Spark
  • 5. • Cluster Mode o submit application to cluster manager • Client Mode o run the driver outside the cluster, with executors in the cluster • Standalone Mode o everything is on one machine Execution Modes Module 3. Spark Computing Framework 5
  • 6. Standalone Spark Cluster Module 2. Installing Spark 6 Driver Program Spark Session ● Spark Context Worker Node Executor ● Cache ● Task ● Task ● ... Cluster Manager
  • 7. • Ensure that you have Java installed, and verify your version • Most recent version you can install with Java 7: version 2.1.x • If you have Java 8+ installed you can install version 2.2.x Installing Local Standalone Module 2. Installing Spark 7
  • 8. • Work within a virtual environment or container • If you are already familiar with this you can skip the next three slides Recommendation for Success Module 2. Installing Spark 8
  • 9. • This is not required but is strongly recommended. • a virtual environment creates an isolated environment for a project o each project can have its own dependencies and not affect other projects o you can easily and safely remove this environment when you are done with it • Examples include Virtualenv, Anaconda, but there are many to good options • You may also use a container-based environment, such as Docker. • Use your preferred approach or organization’s recommended practice Creating a virtual environment Module 2. Installing Spark 9
  • 10. Example : creating a virtual environment. Anaconda $ conda create -n my-env python $ source activate my-env # Here is the same thing, but specifying Python 2.7 $ conda create -n my-env python=2.7 $ source activate my-env # Here is the same thing, but with Python 3.6 $ conda create -n my-env python=3.6 $ source activate my-env Module 2. Installing Spark 10
  • 11. Example : creating a virtual environment. Virtualenv $ cd /path/to/my/venvs $ virtualenv ./my_venv $ source ./my_env/bin/activate Module 2. Installing Spark 11
  • 12. • The most recent version of Pyspark you can install with Java 7 is version 2.1.x • If you have Java 8+ installed you can install Pyspark 2.2.x Install Java Module 2. Installing Spark 12
  • 13. The simple way to install Pyspark Recommended for Spark 2.2.x onwards $ pip install pyspark Module 2. Installing Spark 13
  • 14. • Download source and Build o Browse to https://spark.apache.org/downloads.html o Browse https://spark.apache.org/downloads.html o Choose the following settings : Installing Pyspark 2.1.2 Module 2. Installing Spark 14 o Click on spark-2.1.2.tgz
  • 15. Installing Pyspark 2.1.2 Module 2. Installing Spark 15 o Save this to /ext/spark
  • 16. Installing Pyspark 2.1.2 Recommended for following along in this course $ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark- 2.1.2.tgz # Confirm that the file size is as expected (~13MiB) $ ls -lh spark-2.1.2.tgz # Extract the contents $ tar -xf spark-2.1.2.tgz $ cd /ext/spark/spark-2.1.2 Module 2. Installing Spark 16
  • 17. • Set Java home environment variable o This is necessary only if Java home is not already set. Installing Pyspark 2.1.2 Module 2. Installing Spark 17
  • 18. Installing Pyspark 2.1.2 Set $JAVA_HOME $ echo $JAVA_HOME # If above returns empty, then you’ll need to set it. On Mac OS, you can use result of : $ echo `/usr/libexec/java_home` like so : $ export JAVA_HOME=`/usr/libexec/java_home` Module 2. Installing Spark 18
  • 19. • Get the build instructions o It is good practice to vet these yourself before running. o Understand what you are going to do before doing it o Spark evolves rapidly, so know how to upgrade! • Browse to : https://spark.apache.org/documentation.html o Scroll down to your version, click into that link o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.” o “Note that support for Java 7 was removed as of Spark 2.2.0.” • Now to build it Installing Pyspark 2.1.2 Module 2. Installing Spark 19
  • 20. Installing Pyspark 2.1.2 Build it $ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" $ ./build/mvn -DskipTests clean package [INFO] Total time: 11:11 min $ sudo pip install py4j Module 2. Installing Spark 20
  • 21. Installing Pyspark 2.1.2 Pro tip : turn down overly verbose logs $ cd /ext/spark/spark-2.1.2 $ cp conf/log4j.properties.template conf/log4j.properties Module 2. Installing Spark 21
  • 22. • Pro tip: ensure the following lines are in ~/.bash_profile o export SPARK_HOME="/ext/spark/spark-2.1.2" o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH • If you altered ~/.bash_profile, remember to run the following: o $ source ~/.bash_profile Done Installing Spark Module 2. Installing Spark 22
  • 23. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> Module 2. Installing Spark 23
  • 24. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x1050f8d50> >>> Module 2. Installing Spark 24
  • 25. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x1050f8d50> >>> Module 2. Installing Spark 25
  • 26. Running Pyspark Shell >>> sc = spark.sparkContext >>> rdd = sc.parallelize(["Hello", "World"]) >>> print(rdd.count()) 2 >>> print(rdd.collect()) ['Hello', 'World'] >>> print(rdd.take(2)) ['Hello', 'World'] Module 2. Installing Spark 26
  • 27. Running Pyspark in python shell $ python >>> Module 2. Installing Spark 27
  • 28. Running Pyspark in python shell We’ll simplify this later -- for now let’s ensure we’re using what we just installed $ python >>> import sys >>> import os >>> SPARK_HOME = '/ext/spark/spark-2.1.2' >>> SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip" >>> sys.path.insert(0, os.path.join(SPARK_HOME, "python")) # precede pre-existing >>> sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) >>> os.environ["SPARK_HOME"] = SPARK_HOME >>> Module 2. Installing Spark 28
  • 29. Running Pyspark in python shell >>> from pyspark.sql import SparkSession >>> spark = SparkSession ... .builder ... .appName("Python Spark SQL basic example") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). >>> spark <pyspark.sql.session.SparkSession object at 0x1112e4650> Module 2. Installing Spark 29
  • 30. Running Pyspark in python shell >>> sc = spark.sparkContext >>> rdd = sc.parallelize(["Hello", "World"]) >>> print(rdd.count()) 2 >>> print(rdd.collect()) ['Hello', 'World'] >>> print(rdd.take(2)) ['Hello', 'World'] Module 2. Installing Spark 30
  • 31. Running Pyspark in ipython shell Repeat the steps we did in the python shell : $ ipython In [1]: import os ...: import sys ...: SPARK_HOME = '/ext/spark/spark-2.1.2' ...: SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip" ...: sys.path.insert(0, os.path.join(SPARK_HOME, "python")) ...: sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) # must precede IDE's py4j ...: Module 2. Installing Spark 31
  • 32. Running Pyspark in ipython shell In [2]: java_home = "/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home" ...: os.environ["JAVA_HOME"] = java_home ...: os.environ["SPARK_HOME"] = SPARK_HOME ...: from pyspark.sql import SparkSession ...: Module 2. Installing Spark 32
  • 33. Running Pyspark in ipython shell In [3]: spark = SparkSession ...: .builder ...: .appName("Python Spark SQL basic example") ...: .config("spark.some.config.option", "some-value") ...: .getOrCreate() ...: Module 2. Installing Spark 33
  • 34. Running Pyspark in ipython shell In [4]: spark? Type: SparkSession String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10> File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py Docstring: The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern: [...] Module 2. Installing Spark 34
  • 35. Running Pyspark in ipython shell Running the example code that we ran previously In [5]: sc = spark.sparkContext ...: rdd = sc.parallelize(["Hello", "World"]) ...: print(rdd.count()) ...: print(rdd.collect()) ...: print(rdd.take(2)) ...: 2 ['Hello', 'World'] ['Hello', 'World'] Module 2. Installing Spark 35
  • 36. Running Pyspark in ipython shell In [6]: sc? Type: SparkContext String form: <pyspark.context.SparkContext object at 0x1111702d0> File: /ext/spark/spark-2.1.2/python/pyspark/context.py Docstring: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create L{RDD} and broadcast variables on that cluster. Init docstring: Create a new SparkContext. At least the master and app name should be set, either through the named parameters here or through C{conf}. Module 2. Installing Spark 36
  • 37. Running Pyspark in ipython shell In [7]: rdd? Type: RDD String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py Docstring: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Module 2. Installing Spark 37
  • 38. Running Pyspark in ipython shell Tab completion T In [8]: rdd. rdd.aggregate rdd.coalesce rdd.context rdd.countByValue rdd.aggregateByKey rdd.cogroup rdd.count rdd.ctx rdd.cache rdd.collect rdd.countApprox rdd.distinct rdd.cartesian rdd.collectAsMap rdd.countApproxDistinct rdd.filter rdd.checkpoint rdd.combineByKey rdd.countByKey rdd.first Module 2. Installing Spark 38
  • 39. Running Pyspark in ipython shell Magic functions T In [9]: %whos Variable Type Data/Info ---------------------------------------- SPARK_HOME str /ext/spark/spark-2.1.2 SPARK_PY4J str python/lib/py4j-0.10.4-src.zip SparkSession type <class 'pyspark.sql.session.SparkSession'> java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475 sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0> spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10> Module 2. Installing Spark 39
  • 40. • We learned several ways of installing Pyspark • We ran the Spark Shell • We showed how to run our new install in python and ipython • We tested Pyspark by running sample code in all three of these shells • We used two important objects: the Spark Session, and the Spark Context • We created a RDD and inspected its contents. What we covered Module 2. Installing Spark 40
  • 41. • Next time we’ll cover more ways of running Spark. • We’ll show it in an IDE, and notebook. • We’ll get our first view of the Spark UI Next time: Running Spark Module 2. Installing Spark 41

Editor's Notes

  • #2: Section Beginning (Dark Color Option )
  • #3: This module is geared toward getting your own local standalone version of Spark running. Several exercises get you working with your new Spark instance. This module is important as you will need an environment on which to compete the hands-on exercises in later modules. copyright Mark E Plutowski 2018
  • #4: https://spark.apache.org/docs/2.1.2/cluster-overview.html The cluster manager has its own abstractions which are separate from Spark but can use the same names ‘driver’ node, sometimes called its ‘master’ node This is different from the Spark Driver and the Driver Program ‘worker’ nodes These are tied to physical machines, whereas in Spark they are tied to processes. A Spark Application requests resources from the Cluster Manager depending on the application, this could include a place to run the Spark Driver Program or, just resources for running the Executors Here a Worker Node contains a single Executor. It can have more. How many executor processes run for each worker node in spark? If using Dynamic Allocation, Spark will decide this for you. Or, you can stipulate this When and How this is done is outside the scope of this course, however, this course will prepare you to dig deeper into the answers to this question. copyright Mark E Plutowski 2018
  • #5: The Cluster Manager has its own notion of Driver Node, not to be confused with Spark’s Driver Node The Driver Node in a cluster is the one that is running the Driver Program Driver Program The Driver Program is sometimes used interchangeably to refer to the Spark Application, and to code being executed within the Driver Process created within a Spark Session. Later in this module, we’ll illustrate this by visualizing the Spark Driver in the Spark UI The Driver Program declares the transformations and actions on data The main() method in the Spark application runs on the Driver, This is similar to but distinct from an Executor See the Cluster Overview documentation for reference http://spark.apache.org/docs/latest/cluster-overview.html Pro tip: to avoid confusion, when it isn’t apparent from context be clear what you mean when you say “Driver” when referring to the lines of code that comprise the Spark Application, I say Spark Application code or script when referring to the Driver object that is created by the Spark Session, I say Driver Process when referring to the server in a cluster that is running the driver, I say Master Node in writing, when necessary to disambiguate how you are using the term “Spark Driver” or “Driver Program”, provide a link to Apache Spark documentation for the particular usage you intend. We will see the Spark Driver Process in action using the Spark UI later in this module copyright Mark E Plutowski 2018
  • #6: We cover Execution Modes in more depth in Module 3 -- and I’ll touch on them briefly now. Cluster mode: cluster manager places the driver on a worker mode and the executors on other worker nodes application is provided as Jar, Egg, or application script (python, scala, R) Client mode: same as cluster mode except that driver stays on the client machine that submitted the application may be running he application from a machine not colocated with the workers in the cluster aka “gateway machine” or “edge node” Local : runs everything on a single machine You’ve seen how to create a Spark application that runs just as a traditional executable But you’ve also used spark-submit --- which is a primary way to launch jobs on a Spark cluster By “job” here I mean executing a Driver Program copyright Mark E Plutowski 2018
  • #7: In the standalone mode we install today these components will run on your single laptop or server. https://spark.apache.org/docs/2.1.2/cluster-overview.html copyright Mark E Plutowski 2018
  • #8: You can also use the Databricks Community Edition, which I will demonstrate in the next module. There are many excellent guides https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f http://sigdelta.com/blog/how-to-install-pyspark-locally/ You have a lot of flexibility here. There is a quick and easy installation On the other end of this scale is a full blown cluster configuration. What we will install here is a happy medium : fairly easy with ample flexibility copyright Mark E Plutowski 2018
  • #9: the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what dependencies every other project has. You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands), I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended. There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment. You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices. copyright Mark E Plutowski 2018
  • #10: the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what de pendencies every other project has. You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands), I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended. There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment. You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices. I emphasize this recommendation because of the potentially wide audience, with apologies to the many of you who already know this. As a software engineer you should be aware of dependencies between projects. Also, running the exact version that I am demonstrating is essential in case you have issues with your install in order to debug your setup. copyright Mark E Plutowski 2018
  • #11: copyright Mark E Plutowski 2018
  • #12: copyright Mark E Plutowski 2018
  • #13: In the following, bulleted notes are instructions for what to do in a browser, whereas black backgrounded slides contain command lines to be run from in a terminal console. copyright Mark E Plutowski 2018
  • #14: It can be that easy. However, depending on your particular environment and the version you need, this could end up being limiting or more involved. If this works for you, Great! You can skip the next 10 slides. If not, continue on -- it will take less than 15 minutes. in what follows I am going to provide instructions for installing from source. This allows complete access to the source, example code, and example datasets. If you want to be able to reproduce exactly what I do, please follow these steps. Otherwise, you may skip ahead. copyright Mark E Plutowski 2018
  • #15: This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source. I encourage you to use the latest, 2.2.x ; however, version 2.1.2 is well tested. It also runs in Java 7, whereas 2.2.x requires Java 8+. Also, if you want to ensure that your examples run as closely to mine as possible, use the setup instructions offered here. Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS. copyright Mark E Plutowski 2018
  • #16: Save this to /ext/spark
  • #17: The first line is unnecessary if you did this as a save-as from within the browser. You could also use your File Browser to inspect the size instead of using the second line. This is to ensure that you didn’t mistakenly download the wrong file when doing save as from the browser. Extract the contents and change directory.
  • #18: This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source. Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS. Why not just upgrade to Java 8+? If you are applying this to a production environment where it is still relying on Java 7, then doing so could delay getting your application into production, if it utilizes features that depend on Java 8+. Check the versions of Java being utilized in your deployment environment, respectively as well for Python, Scala, or R, if you are going to be using those for scripting your Spark application. copyright Mark E Plutowski 2018
  • #19: copyright Mark E Plutowski 2018
  • #20: Always Read the Notes. You might need to perform an additional installation to proceed, depending on your operating system, your environment, the version you chose, etc. copyright Mark E Plutowski 2018
  • #21: The build/mvn line will take ten to fifteen minutes. copyright Mark E Plutowski 2018
  • #22: The base install uses logging settings that are overly verbose. To enable settings that are less verbose, do this step. This also points you to where other logging configuration settings can be made.
  • #23: That is probably the trickiest part of this course. Once you have Spark installed, you should be able to follow along from here. If you have Pyspark 2.1.2 installed, then all subsequent code examples should work identically for you. Of course, there are always exceptions, depending upon your particular setup; however we are probably over the hardest part. Hopefully it wasn’t too tricky for any of you. For many of you this should have been pretty straightforwards. If you chose the quick and easy installation path, and are rejoining us now, welcome back! Do make sure that you know where to find code examples and example datasets that we’ll be referring to subsequently, which were downloaded along with the source. copyright Mark E Plutowski 2018
  • #24: That lowest line is important -- let’s see what this means. Enter spark at the prompt, like so ... copyright Mark E Plutowski 2018
  • #25: This is your Spark Session -- this provides the point of entry for interacting with Spark. This also provides access to the RDD Context (commonly referred to using the variable sc) It also provides access to the Sql Context (commonly referred to using the variable sqlContext) copyright Mark E Plutowski 2018
  • #26: This is your Spark Session -- this provides the point of entry for interacting with Spark. This also provides access to the RDD Context (commonly referred to using the variable sc) It also provides access to the Sql Context (commonly referred to using the variable sqlContext) copyright Mark E Plutowski 2018
  • #27: We obtained a handle to the Spark context from the Spark session Created an RDD, which is a type of parallelized collection. Confirmed that it contains two rows. Displayed its contents. Displayed its contents in a different way. copyright Mark E Plutowski 2018
  • #28: I’ll show you a way to use the version of pyspark that you just installed. If you get you environment variables and path settings configured copyright Mark E Plutowski 2018
  • #29: This helps ensure that python is using the version we installed if you set your environment variables and path settings properly, you can skip this step. However, this is a way to ensure that you are using the version we installed. We’ll simplify this in a later module. Note that we inserted the path instead of appending it … this avoids very wierd bugs that can arise, especially in an IDE already having a different version of Py4j installed. copyright Mark E Plutowski 2018
  • #30: Note that the Spark Shell gives an already instantiated Spark Session variable, ‘spark’. Here, we need to create that ourselves. copyright Mark E Plutowski 2018
  • #31: This runs the same example we ran previously in the Spark shell. Note that it is exactly the same. Once you get your development environment configured, you can typically run things one place or the other without modification. We obtained a handle to the Spark context from the Spark session Created an RDD, which is a type of parallelized collection. Confirmed that it contains two rows. Displayed its contents. Displayed its contents in a different way. copyright Mark E Plutowski 2018
  • #32: Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  • #33: Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  • #34: Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  • #35: Repeat the steps we used in the python shell. Note that we can now inspect the object using object? notation. I clipped 20+ lines of output it generated, which gives additional tips on how to use this object. copyright Mark E Plutowski 2018
  • #36: We of course get the identical results using the code snippet we used to test it in the Spark shell and the python shell. copyright Mark E Plutowski 2018
  • #37: This tells us more about the sc variable we created. This is the Spark Context object. copyright Mark E Plutowski 2018
  • #38: This tells us more about the rdd variable we created, which is an RDD. copyright Mark E Plutowski 2018
  • #39: And of course we get the other niceties of the ipython shell, such as tab completion. copyright Mark E Plutowski 2018
  • #40: And of course we get the other niceties of the ipython shell, magic functions copyright Mark E Plutowski 2018
  • #41: copyright Mark E Plutowski 2018
  • #42: Next module shows how to run a Spark application. It shows more ways you can work with Spark, including within an IDE and notebook. We’ll get our first view of the Spark UI, using it to inspect a running application. copyright Mark E Plutowski 2018