SlideShare a Scribd company logo
Scalable Machine Learning
with PySpark
Ladle Patel
Life Cycle of Data Science Project
Scalable Machine Learning with PySpark
Data is Growing So Fast
Sources of Big Data
Scalable Machine Learning with PySpark
Working Individually
What is Solution ?
Teamwork
Problem in Teamwork
What You Need ?
Team Hierarchy
Individual vsTeamwork
Master-Slave Architecture
What is Spark ?
Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do
ETL, machine learning & graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich high-
level APIs for the programming languages: Scala, Python, Java and R
Real Time Spark Cluster
Life Cycle of Big Data - Data Science Project
Spark
dataframe
Hands-on
Educational Materials and Tutorials
https://docs.databricks.com/spark/latest/training/index.html
https://spark.apache.org/
https://github.com/lp-dataninja
https://github.com/databricks/Spark-The-Definitive-Guide
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/57
22190290795989/875048944749694/8175309257345795/latest.html
Join our team : We are hiring
ladle.patel@genpact.com
ladlepatelr@gmail.com
Dataset to build Scalable Machine Learning Models
https://www.kaggle.com/benhamner/competitions-with-largest-datasets
https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
Scalable Machine Learning with PySpark

More Related Content

PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
Dive into PySpark
PDF
Intro to PySpark: Python Data Analysis at scale in the Cloud
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PPTX
Programming in Spark using PySpark
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
Apache Arrow and Pandas UDF on Apache Spark
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Dive into PySpark
Intro to PySpark: Python Data Analysis at scale in the Cloud
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Programming in Spark using PySpark
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
How does that PySpark thing work? And why Arrow makes it faster?
Apache Arrow and Pandas UDF on Apache Spark

What's hot (20)

PDF
Improving Pandas and PySpark interoperability with Apache Arrow
PDF
Speeding up PySpark with Arrow
PDF
Apache spark linkedin
PDF
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
PPTX
PySpark dataframe
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Intro to Apache Spark
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
PySaprk
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
PDF
PySpark Best Practices
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
PPTX
Koalas: Unifying Spark and pandas APIs
PPTX
Apache spark
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
Introduction to Apache Spark
Improving Pandas and PySpark interoperability with Apache Arrow
Speeding up PySpark with Arrow
Apache spark linkedin
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
PySpark dataframe
Koalas: Unifying Spark and pandas APIs
Python and Bigdata - An Introduction to Spark (PySpark)
Parallelizing Existing R Packages with SparkR
Intro to Apache Spark
Performant data processing with PySpark, SparkR and DataFrame API
PySaprk
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
PySpark Best Practices
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Pandas UDF and Python Type Hint in Apache Spark 3.0
Koalas: Unifying Spark and pandas APIs
Apache spark
Keeping Spark on Track: Productionizing Spark for ETL
An Insider’s Guide to Maximizing Spark SQL Performance
Introduction to Apache Spark
Ad

Similar to Scalable Machine Learning with PySpark (20)

PDF
Big Data Processing with Spark and Scala
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PDF
Started with-apache-spark
PPTX
Spark for big data analytics
PDF
5 things one must know about spark!
PDF
Apache spark
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PDF
5 Reasons why Spark is in demand!
PDF
Module01
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PPTX
Big data Processing with Apache Spark & Scala
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PPTX
5 reasons why spark is in demand!
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Spark For Faster Batch Processing
PPTX
Introduction to spark
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
20160512 apache-spark-for-everyone
Big Data Processing with Spark and Scala
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Started with-apache-spark
Spark for big data analytics
5 things one must know about spark!
Apache spark
Apache Spark for Everyone - Women Who Code Workshop
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
5 Reasons why Spark is in demand!
Module01
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Big data Processing with Apache Spark & Scala
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
5 reasons why spark is in demand!
Big Data Processing with .NET and Spark (SQLBits 2020)
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Spark For Faster Batch Processing
Introduction to spark
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
20160512 apache-spark-for-everyone
Ad

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
modul_python (1).pptx for professional and student
PPTX
Computer network topology notes for revision
PDF
Business Analytics and business intelligence.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Transcultural that can help you someday.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
annual-report-2024-2025 original latest.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Managing Community Partner Relationships
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Data Science and Data Analysis
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
modul_python (1).pptx for professional and student
Computer network topology notes for revision
Business Analytics and business intelligence.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
SAP 2 completion done . PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Transcultural that can help you someday.
Data_Analytics_and_PowerBI_Presentation.pptx
Mega Projects Data Mega Projects Data
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
Supervised vs unsupervised machine learning algorithms
annual-report-2024-2025 original latest.
Optimise Shopper Experiences with a Strong Data Estate.pdf
Managing Community Partner Relationships
IBA_Chapter_11_Slides_Final_Accessible.pptx

Scalable Machine Learning with PySpark