SlideShare a Scribd company logo
3
1. Introduction to Big Data and Apache Spark
 What is Big Data?
 What is Apache Spark?
 Features of Apache Spark
2. Overview of Spark Architecture
3. Spark Components
4. Spark Basic & Programming Model
 Spark Context
 Spark Session
 RDD
 Dataframe
 RDD v/s Dataframe
5. Advantages of Apache Spark
6. Disadvantages of Apache Spark
7. Demo
Most read
5
What is Big Data?
Big Data means very large and complex sets
of information that are too big and fast for
traditional computer systems to handle. It
includes a wide variety of data types from many
sources.
It is characterized by the 5 Vs:
 Volume: Massive amounts of data.
 Velocity: Speed at which data is generated
and processed.
 Variety: Different types of data (structured,
semi-structured, unstructured).
 Veracity: Data quality and accuracy.
 Value: Value the data provides.
Most read
16
RDD v/s Dataframe
Features RDD Dataframe
Data Format Structured and unstructured Structured and semi-structured
APIs
Provide a low-level API that requires
more code to perform transformations
and actions on data
Provide a high-level API that makes it
easier to perform transformations and
actions on data.
Schema enforcement
Do not have an explicit schema, and are
often used for unstructured data.
Dataframe enforce schema at runtime.
Have an explicit schema that
describes the data and its types.
Optimization
No inbuilt optimization engine is
available in RDD.
It uses a catalyst optimizer for
optimization.
Most read
Getting Started
with
Apache Spark
Presented By
Manish Mishra
Pradyuman Pratap Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction to Big Data and Apache Spark
 What is Big Data?
 What is Apache Spark?
 Features of Apache Spark
2. Overview of Spark Architecture
3. Spark Components
4. Spark Basic & Programming Model
 Spark Context
 Spark Session
 RDD
 Dataframe
 RDD v/s Dataframe
5. Advantages of Apache Spark
6. Disadvantages of Apache Spark
7. Demo
Getting Started with Apache Spark (Scala)
What is Big Data?
Big Data means very large and complex sets
of information that are too big and fast for
traditional computer systems to handle. It
includes a wide variety of data types from many
sources.
It is characterized by the 5 Vs:
 Volume: Massive amounts of data.
 Velocity: Speed at which data is generated
and processed.
 Variety: Different types of data (structured,
semi-structured, unstructured).
 Veracity: Data quality and accuracy.
 Value: Value the data provides.
What is Apache Spark?
 Apache Spark is an open-source analytical processing engine for large-scale powerful
distributed data processing and machine learning applications. It can handle
both batches as well as real-time analytics and data processing workloads.
 It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently
use it for more types of computations, which includes interactive queries and stream
processing.
 The main feature of Spark is its in-memory computing that increases the processing
speed of an application.
Features of Apache Spark
01 02
03
05 06
04
In Memory Computation
Speed
Different Cluster Managers
Distributed Processing
Fault Tolerant
Lazy Evaluation
02
Apache Spark Architecture
03
Spark Components
Spark Core
Spark SQL
Supported
Languages
Spark
Streaming
Real Time
Mlib
Machine
Learning
GraphX
Graph
Processing
Scala Java Python R
Spark
Engine
Libraries
04
Spark Basics
1. Spark Context: SparkContext is the primary entry point to any spark functionality.
When we run any Spark application, a driver program starts, which has the main
function and your SparkContext gets initiated here. The driver program then runs the
operations inside the executors on worker nodes.
2. Spark Session: SparkSession is a unified entry point for Spark applications; it was
introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities,
including RDDs, DataFrames, and Datasets, providing a unified interface to work with
structured data processing.
RDD
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
 There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
RDD Operation:
o Transformation
o Actions
Dataframe
 In Spark, Dataframe are the distributed
collections of data, organized into rows and
columns. Each column in a Dataframe has a
name and an associated type. Dataframe are
like traditional database tables, which are
structured and concise.
 We can say that Dataframe are relational
databases with better optimization
techniques.
 Spark Dataframe can be created from
various sources, such as Hive tables, log
tables, external databases, or the existing
RDDs. Dataframe allow the processing of
huge amounts of data.
RDD v/s Dataframe
Features RDD Dataframe
Data Format Structured and unstructured Structured and semi-structured
APIs
Provide a low-level API that requires
more code to perform transformations
and actions on data
Provide a high-level API that makes it
easier to perform transformations and
actions on data.
Schema enforcement
Do not have an explicit schema, and are
often used for unstructured data.
Dataframe enforce schema at runtime.
Have an explicit schema that
describes the data and its types.
Optimization
No inbuilt optimization engine is
available in RDD.
It uses a catalyst optimizer for
optimization.
05
Advantages of Apache Spark
 In Memory Computation
 Speed
 Ease of Use
 Advanced Analytics
 Fault Tolerant
 Multi Language Support
06
Disadvantages of Apache Spark
 Small Files Issue
 File Management System
 No automatic optimization process
 Fewer Algorithms
07
Getting Started with Apache Spark (Scala)

More Related Content

Similar to Getting Started with Apache Spark (Scala) (20)

Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Marketing Strategyyguigiuiiiguooogu.pptx
Marketing Strategyyguigiuiiiguooogu.pptxMarketing Strategyyguigiuiiiguooogu.pptx
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
Suraj Thapaliya
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
ITLAb21
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptxEngagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
Venkateswaran Kandasamy
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Spark Concepts Cheat Sheet_Interview_Question.pdf
Spark Concepts Cheat Sheet_Interview_Question.pdfSpark Concepts Cheat Sheet_Interview_Question.pdf
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
apache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptxapache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptx
abhinavas9207
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache spark
Apache sparkApache spark
Apache spark
Ramakrishna kapa
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
Srikrishna k
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Marketing Strategyyguigiuiiiguooogu.pptx
Marketing Strategyyguigiuiiiguooogu.pptxMarketing Strategyyguigiuiiiguooogu.pptx
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
ITLAb21
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptxEngagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Spark Concepts Cheat Sheet_Interview_Question.pdf
Spark Concepts Cheat Sheet_Interview_Question.pdfSpark Concepts Cheat Sheet_Interview_Question.pdf
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
apache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptxapache spark Presentation general seminar.pptx
apache spark Presentation general seminar.pptx
abhinavas9207
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apachespark 160612140708
Apachespark 160612140708Apachespark 160612140708
Apachespark 160612140708
Srikrishna k
 
39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf39.-Introduction-to-Sparkspark and all-1.pdf
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 

More from Knoldus Inc. (20)

Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdfEdge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Down the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training RoadblocksDown the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training Roadblocks
Rustici Software
 
Agentic AI: Beyond the Buzz- LangGraph Studio V2
Agentic AI: Beyond the Buzz- LangGraph Studio V2Agentic AI: Beyond the Buzz- LangGraph Studio V2
Agentic AI: Beyond the Buzz- LangGraph Studio V2
Shashikant Jagtap
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Anish Kumar
 
Introduction to Internet of things .ppt.
Introduction to Internet of things .ppt.Introduction to Internet of things .ppt.
Introduction to Internet of things .ppt.
hok12341073
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Introduction to Typescript - GDG On Campus EUE
Introduction to Typescript - GDG On Campus EUEIntroduction to Typescript - GDG On Campus EUE
Introduction to Typescript - GDG On Campus EUE
Google Developer Group On Campus European Universities in Egypt
 
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementaryMurdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
 
The State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry ReportThe State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry Report
Liveplex
 
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FMEEnabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025
Safe Software
 
Oracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization ProgramOracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization Program
VICTOR MAESTRE RAMIREZ
 
Ben Blair - Operating Safely in a Vibe Coding World
Ben Blair - Operating Safely in a Vibe Coding WorldBen Blair - Operating Safely in a Vibe Coding World
Ben Blair - Operating Safely in a Vibe Coding World
AWS Chicago
 
Providing an OGC API Processes REST Interface for FME Flow
Providing an OGC API Processes REST Interface for FME FlowProviding an OGC API Processes REST Interface for FME Flow
Providing an OGC API Processes REST Interface for FME Flow
Safe Software
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdfArtificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdfEdge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Down the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training RoadblocksDown the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training Roadblocks
Rustici Software
 
Agentic AI: Beyond the Buzz- LangGraph Studio V2
Agentic AI: Beyond the Buzz- LangGraph Studio V2Agentic AI: Beyond the Buzz- LangGraph Studio V2
Agentic AI: Beyond the Buzz- LangGraph Studio V2
Shashikant Jagtap
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...
Anish Kumar
 
Introduction to Internet of things .ppt.
Introduction to Internet of things .ppt.Introduction to Internet of things .ppt.
Introduction to Internet of things .ppt.
hok12341073
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Murdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementaryMurdledescargadarkweb.pdfvolumen1 100 elementary
Murdledescargadarkweb.pdfvolumen1 100 elementary
JorgeSemperteguiMont
 
The State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry ReportThe State of Web3 Industry- Industry Report
The State of Web3 Industry- Industry Report
Liveplex
 
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FMEEnabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025Mastering AI Workflows with FME - Peak of Data & AI 2025
Mastering AI Workflows with FME - Peak of Data & AI 2025
Safe Software
 
Oracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization ProgramOracle Cloud and AI Specialization Program
Oracle Cloud and AI Specialization Program
VICTOR MAESTRE RAMIREZ
 
Ben Blair - Operating Safely in a Vibe Coding World
Ben Blair - Operating Safely in a Vibe Coding WorldBen Blair - Operating Safely in a Vibe Coding World
Ben Blair - Operating Safely in a Vibe Coding World
AWS Chicago
 
Providing an OGC API Processes REST Interface for FME Flow
Providing an OGC API Processes REST Interface for FME FlowProviding an OGC API Processes REST Interface for FME Flow
Providing an OGC API Processes REST Interface for FME Flow
Safe Software
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdfArtificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
 
Ad

Getting Started with Apache Spark (Scala)

  • 1. Getting Started with Apache Spark Presented By Manish Mishra Pradyuman Pratap Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction to Big Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
  • 5. What is Big Data? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
  • 6. What is Apache Spark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
  • 7. Features of Apache Spark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
  • 8. 02
  • 10. 03
  • 11. Spark Components Spark Core Spark SQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
  • 12. 04
  • 13. Spark Basics 1. Spark Context: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
  • 14. RDD  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
  • 15. Dataframe  In Spark, Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
  • 16. RDD v/s Dataframe Features RDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
  • 17. 05
  • 18. Advantages of Apache Spark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
  • 19. 06
  • 20. Disadvantages of Apache Spark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
  • 21. 07