SlideShare a Scribd company logo
INTRO TO PYSPARK
Jon Haddad, Technical Evangelist, DataStax
@rustyrazorblade
WHAT TOOLS ARE YOU ALREADY
USING FOR DATA ANALYSIS?
NumPy / SciPy
Pandas
iPython Notebooks
scikit-learn
hdf5
pybrain
WHAT'S THE PROBLEM?
GREAT TOOLS
BUT NOT BUILT FOR BIG DATA SETS
And not real time...
LIMITED TO 1 MACHINE
What if we have a lot of data?
What if we use Cassandra?
We need distributed computing
Use when we have more data what fits on a single machine
WHAT IS SPARK?
Fast and general purpose cluster computing system
LANGUAGES
Scala
Java
R (version >= 1.4)
Python
WHAT CAN I DO WITH IT?
Read and write data in bulk to and from Cassandra
Batch processing
Stream processing
Machine Learning
Distributed SQL
Operate on entire dataset (or at least a big chunk of it)
BATCH PROCESSING
RDD
Resilliant Distributed Dataset (it's a big list)
Use functional concepts like map, filter, reduce
Caveat: Will always pay penalty going from JVM <> Python
DATA MIGRATIONS
USERS
name favorite_food
jon bacon
luke pie
patrick pizza
rachel pizza
SET UP OUR KEYSPACE
create KEYSPACE demo WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
use demo ;
CREATE OUR DEMO USER TABLE
create TABLE user ( name text PRIMARY KEY,
favorite_food text );
insert into user (name, favorite_food) values ('jon', 'bacon');
insert into user (name, favorite_food) values ('luke', 'pie');
insert into user (name, favorite_food) values ('patrick', 'pizza');
insert into user (name, favorite_food) values ('rachel', 'pizza');
create table favorite_foods ( food text, name text,
primary key (food, name));
MAPPING FOODS TO USERS
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
conf = SparkConf() 
.setAppName("User Food Migration") 
.setMaster("spark://127.0.0.1:7077") 
.set("spark.cassandra.connection.host", "127.0.0.1")
sc = CassandraSparkContext(conf=conf)
users = sc.cassandraTable("demo", "user")
favorite_foods = users.map(lambda x:
{"food":x['favorite_food'],
"name":x['name']} )
favorite_foods.saveToCassandra("demo", "favorite_foods")
MIGRATION RESULTS
cqlsh:demo> select * from favorite_foods ;
food | name
-------+---------
pizza | patrick
pizza | rachel
pie | luke
bacon | jon
(4 rows)
cqlsh:demo> select * from favorite_foods where food = 'pizza';
food | name
-------+---------
pizza | patrick
pizza | rachel
AGGREGATIONS
u = sc.cassandraTable("demo", "user")
u.map(lambda x: (x['favorite_food'], 1)).
reduceByKey(lambda x, y: x + y).collect()
[(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
RDDS ARE COOL
And very powerful
But kind of annoying
DATAFRAMES
From R language
Available in Python via Pandas
DataFrames allow for optimized filters, sorting, grouping
With Spark, all the data stays in the JVM
With Cassandra it's still expensive due to JVM <> Python
But it can be fixed
DATAFRAMES EXAMPLE
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext # needed for toDF()
users = sc.cassandraTable("demo", "user").toDF()
food_count = users.select("favorite_food").
groupBy("favorite_food").count()
food_count.collect()
[Row(favorite_food=u'bacon', count=1),
Row(favorite_food=u'pizza', count=2),
Row(favorite_food=u'pie', count=1)]
SPARKSQL
Register dataframes as tables
JOIN, GROUP BY
SPARKSQL IN ACTION
sql = SQLContext(sc)
users = sc.cassandraTable("demo", "user").toDF()
users.registerTempTable("users")
sql.sql("""select favorite_food, count(favorite_food)
from users group by favorite_food """).collect()
[Row(favorite_food=u'bacon', c1=1),
Row(favorite_food=u'pizza', c1=2),
Row(favorite_food=u'pie', c1=1)]
STREAMING
Operate on batch windows
Each batch is a small RDD
PRETTY PICTURE
STREAMING
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
stream = StreamingContext(sc, 1) # 1 second window
kafka_stream = KafkaUtils.createStream(stream, 
"localhost:2181", 
"raw-event-streaming-consumer",
{"pageviews":1})
# manipulate kafka_stream as an RDD
stream.start()
stream.awaitTermination()
MACHINE LEARNING
Supervised learning
Unsupervised learning
SUPERVISED LEARNING
When we know the inputs and outputs
Example: Real estate prices
Take existing knowledge about houses and prices
Build a model to predict the future
UNSUPERVISED LEARNING
When we don't know the output
Popular usage: discover groups
Intro to py spark (and cassandra)
INTERACTIVE IPYTHON NOTEBOOKS
Iterate quickly
Visualize your data
Intro to py spark (and cassandra)
GET STARTED!
Open Source:
Download Cassandra
Download Spark
Cassandra PySpark Repo:
https://github.com/TargetHolding/pyspark-cassandra
Integrated solution
Download DataStax Enterprise

More Related Content

PDF
Awr + 12c performance tuning
PDF
Reading The Source Code of Presto
PPTX
Demystifying data engineering
PDF
Build Real-Time Applications with Databricks Streaming
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Spark Summit EU talk by Mike Percy
PDF
Getting Started with Apache Spark on Kubernetes
Awr + 12c performance tuning
Reading The Source Code of Presto
Demystifying data engineering
Build Real-Time Applications with Databricks Streaming
How a Semantic Layer Makes Data Mesh Work at Scale
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Spark Summit EU talk by Mike Percy
Getting Started with Apache Spark on Kubernetes

What's hot (20)

PPTX
Apache spark 소개 및 실습
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PPTX
Presto best practices for Cluster admins, data engineers and analysts
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PDF
Introduction to Spark with Python
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
Map reduce vs spark
PDF
Data Platform Architecture Principles and Evaluation Criteria
PDF
Apache Spark Overview
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Azure Synapse 101 Webinar Presentation
PDF
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
PDF
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
PDF
Understanding LLM LLMOps & MLOps_open version.pdf
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Apache spark 소개 및 실습
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Azure Data Factory Data Flow Performance Tuning 101
Presto best practices for Cluster admins, data engineers and analysts
Apache Spark in Depth: Core Concepts, Architecture & Internals
Azure Databricks - An Introduction (by Kris Bock)
YugaByte DB Internals - Storage Engine and Transactions
Data Lakehouse Symposium | Day 1 | Part 1
Introduction to Spark with Python
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Map reduce vs spark
Data Platform Architecture Principles and Evaluation Criteria
Apache Spark Overview
Data Lakehouse Symposium | Day 1 | Part 2
Azure Synapse Analytics Overview (r2)
Azure Synapse 101 Webinar Presentation
[Pgday.Seoul 2018] 이기종 DB에서 PostgreSQL로의 Migration을 위한 DB2PG
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Understanding LLM LLMOps & MLOps_open version.pdf
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Ad

Similar to Intro to py spark (and cassandra) (20)

PDF
Spark and cassandra (Hulu Talk)
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Intro to Spark and Spark SQL
PDF
Apache cassandra & apache spark for time series data
PDF
Enter the Snake Pit for Fast and Easy Spark
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Lightning fast analytics with Spark and Cassandra
PDF
Olap with Spark and Cassandra
PDF
OLAP with Cassandra and Spark
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Analytics with Cassandra & Spark
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PDF
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark and cassandra (Hulu Talk)
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PySpark Cassandra - Amsterdam Spark Meetup
Owning time series with team apache Strata San Jose 2015
Intro to Spark and Spark SQL
Apache cassandra & apache spark for time series data
Enter the Snake Pit for Fast and Easy Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Lightning fast analytics with Spark and Cassandra
Olap with Spark and Cassandra
OLAP with Cassandra and Spark
Spark SQL Deep Dive @ Melbourne Spark Meetup
Introduction to Spark Datasets - Functional and relational together at last
Analytics with Cassandra & Spark
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Ad

More from Jon Haddad (15)

PDF
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
PDF
Performance tuning
PDF
Cassandra Core Concepts - Cassandra Day Toronto
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Cassandra Core Concepts
PDF
Cassandra 3.0 Awesomeness
PDF
Intro to Cassandra
PDF
Python and cassandra
PDF
Python performance profiling
PDF
Diagnosing Problems in Production - Cassandra
PDF
Python & Cassandra - Best Friends
PDF
Introduction to Cassandra - Denver
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PDF
Crash course intro to cassandra
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Performance tuning
Cassandra Core Concepts - Cassandra Day Toronto
Diagnosing Problems in Production (Nov 2015)
Cassandra Core Concepts
Cassandra 3.0 Awesomeness
Intro to Cassandra
Python and cassandra
Python performance profiling
Diagnosing Problems in Production - Cassandra
Python & Cassandra - Best Friends
Introduction to Cassandra - Denver
Diagnosing Problems in Production: Cassandra Summit 2014
Crash course intro to cassandra
Cassandra meetup slides - Oct 15 Santa Monica Coloft

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25-Week II
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Intro to py spark (and cassandra)

  • 1. INTRO TO PYSPARK Jon Haddad, Technical Evangelist, DataStax @rustyrazorblade
  • 2. WHAT TOOLS ARE YOU ALREADY USING FOR DATA ANALYSIS? NumPy / SciPy Pandas iPython Notebooks scikit-learn hdf5 pybrain
  • 3. WHAT'S THE PROBLEM? GREAT TOOLS BUT NOT BUILT FOR BIG DATA SETS And not real time...
  • 4. LIMITED TO 1 MACHINE What if we have a lot of data? What if we use Cassandra? We need distributed computing
  • 5. Use when we have more data what fits on a single machine WHAT IS SPARK? Fast and general purpose cluster computing system
  • 7. WHAT CAN I DO WITH IT? Read and write data in bulk to and from Cassandra Batch processing Stream processing Machine Learning Distributed SQL
  • 8. Operate on entire dataset (or at least a big chunk of it) BATCH PROCESSING
  • 9. RDD Resilliant Distributed Dataset (it's a big list) Use functional concepts like map, filter, reduce Caveat: Will always pay penalty going from JVM <> Python
  • 11. USERS name favorite_food jon bacon luke pie patrick pizza rachel pizza
  • 12. SET UP OUR KEYSPACE create KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; use demo ;
  • 13. CREATE OUR DEMO USER TABLE create TABLE user ( name text PRIMARY KEY, favorite_food text ); insert into user (name, favorite_food) values ('jon', 'bacon'); insert into user (name, favorite_food) values ('luke', 'pie'); insert into user (name, favorite_food) values ('patrick', 'pizza'); insert into user (name, favorite_food) values ('rachel', 'pizza'); create table favorite_foods ( food text, name text, primary key (food, name));
  • 14. MAPPING FOODS TO USERS from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf conf = SparkConf() .setAppName("User Food Migration") .setMaster("spark://127.0.0.1:7077") .set("spark.cassandra.connection.host", "127.0.0.1") sc = CassandraSparkContext(conf=conf) users = sc.cassandraTable("demo", "user") favorite_foods = users.map(lambda x: {"food":x['favorite_food'], "name":x['name']} ) favorite_foods.saveToCassandra("demo", "favorite_foods")
  • 15. MIGRATION RESULTS cqlsh:demo> select * from favorite_foods ; food | name -------+--------- pizza | patrick pizza | rachel pie | luke bacon | jon (4 rows) cqlsh:demo> select * from favorite_foods where food = 'pizza'; food | name -------+--------- pizza | patrick pizza | rachel
  • 16. AGGREGATIONS u = sc.cassandraTable("demo", "user") u.map(lambda x: (x['favorite_food'], 1)). reduceByKey(lambda x, y: x + y).collect() [(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
  • 17. RDDS ARE COOL And very powerful But kind of annoying
  • 18. DATAFRAMES From R language Available in Python via Pandas DataFrames allow for optimized filters, sorting, grouping With Spark, all the data stays in the JVM With Cassandra it's still expensive due to JVM <> Python But it can be fixed
  • 19. DATAFRAMES EXAMPLE from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext # needed for toDF() users = sc.cassandraTable("demo", "user").toDF() food_count = users.select("favorite_food"). groupBy("favorite_food").count() food_count.collect() [Row(favorite_food=u'bacon', count=1), Row(favorite_food=u'pizza', count=2), Row(favorite_food=u'pie', count=1)]
  • 20. SPARKSQL Register dataframes as tables JOIN, GROUP BY
  • 21. SPARKSQL IN ACTION sql = SQLContext(sc) users = sc.cassandraTable("demo", "user").toDF() users.registerTempTable("users") sql.sql("""select favorite_food, count(favorite_food) from users group by favorite_food """).collect() [Row(favorite_food=u'bacon', c1=1), Row(favorite_food=u'pizza', c1=2), Row(favorite_food=u'pie', c1=1)]
  • 22. STREAMING Operate on batch windows Each batch is a small RDD
  • 24. STREAMING from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils stream = StreamingContext(sc, 1) # 1 second window kafka_stream = KafkaUtils.createStream(stream, "localhost:2181", "raw-event-streaming-consumer", {"pageviews":1}) # manipulate kafka_stream as an RDD stream.start() stream.awaitTermination()
  • 26. SUPERVISED LEARNING When we know the inputs and outputs Example: Real estate prices Take existing knowledge about houses and prices Build a model to predict the future
  • 27. UNSUPERVISED LEARNING When we don't know the output Popular usage: discover groups
  • 29. INTERACTIVE IPYTHON NOTEBOOKS Iterate quickly Visualize your data
  • 31. GET STARTED! Open Source: Download Cassandra Download Spark Cassandra PySpark Repo: https://github.com/TargetHolding/pyspark-cassandra Integrated solution Download DataStax Enterprise