Apache Spark Usage in the Open Source Ecosystem

Jun 9, 20162 likes2,549 views

The document discusses the usage statistics and library integration of Apache Spark among users of Databricks, highlighting user demographics and language preferences. Users have extensively adopted external libraries for ETL, visualization, and advanced analytics across Python, Scala, and R. The summary indicates a strong trend towards mixing languages and packages, showcasing the flexibility of Apache Spark in an open-source ecosystem.

Apache Spark Usage in the
Open Source Ecosystem
Hossein Falaki
@mhfalaki

About me
• Software Engineer /part-time Data Scientist atDatabricks
• I started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR and Rnotebooks at Databricks
2

Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3

Databricks Community Edition
• In February Databricks launched a free version of its cloud based
platform in beta
• Since then more than 8,000 users registered
• Users created over 61,000 notebooks indifferent languages
• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5

What % of users use other libraries
Language % users importing external libs Average # libs Median # libs
Python 75 % 9 2
Scala 55 % 3 1
R 57 % 6 1
6

What are these?
ETL
• re
• datetime
• pandas
• json
• csv
• string
• math /operator
• urllib /urllib2
11
Visualization
• matplotlib
• ggplot
• seaborn
Advanced analytics
• numpy
• sklearn
• graphframes
• tensorflow
• scipy
Other
• test_helper
• os
• md5

What are these?
ETL
• java/scala util
• scala.collection
• scala.math
• java.{io, nio}
• java.text
• o.a.commons
• kafka
• twitter4j
16
Visualization
• ?
Advanced analytics
• spark.ml
• graphframes
Other
• java.net
• scala.sys

What are these?
ETL
• dplyr
• plyr
• reshape2
• jsonlite
• tidyr
• lubridate
• httr
• data.table
21
Visualization
• ggplot2
• beanplot
• plotly
• ...
Advanced analytics
• sparkr
• h2o
• caret
• e1071
Other
• devtools
• magrittr

Languages have unique features
24
Scala/ Python / R R / Python Scala / Python/ R
• 25 % of users,use multiple languages
• 3% of notebooks mix different languages

Summary
• Spark users extensively mix itwith other packages in different languages
– One ofgoals ofSpark project is working well with other projects
• ETL related libraries are the most popular category
– Opportunities for newdata sources
• Notebooks are being used for “small data” aswell as“big data.”
• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage
– Scala is missing visualization libraries
25

Try your favorite library in Databricks
26
http://databricks.com/ce
Try latest version of Apache Spark and previewof Spark 2.0

This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.

Composable Parallel Processing in Apache Spark and WeldDatabricks

The document discusses composable parallel processing in Apache Spark and introduces the Weld runtime, emphasizing the need for efficient composition of libraries in big data processing. It highlights Spark's goals to provide a unified engine and API for batch, interactive, and streaming applications, as well as the benefits of structured APIs in Spark 2.0 for performance and programmability. Additionally, it addresses challenges regarding data representation and the inefficiencies of traditional library composition, proposing new composition interfaces like Weld to optimize data movement and execution across various workloads.

Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit

Patrick Bamba presents on building a translation engine with Spark Streaming to enable communication between legacy code and microservices. The translation engine acts as an anti-corruption layer, translating data and requests between the different systems. Spark is well-suited for this purpose due to its streaming capabilities and built-in connectors for various data sources and sinks. The presentation provides an example implementation using structured streaming to interface legacy systems with microservices through sources, transformations, and sinks.

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

The document discusses the integration of R with Apache Spark, highlighting the benefits and capabilities of the 'SparkR' package, which allows for distributed computing in R. It introduces key speakers from Databricks, covers the architecture of SparkR, and demonstrates its applications through various examples. Additionally, it outlines how data scientists can leverage SparkR for efficient data manipulation and analysis without local storage constraints.

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

The document discusses the foundations for scaling machine learning (ML) within Apache Spark, highlighting its capabilities for big data computing through features like Resilient Distributed Datasets (RDDs), DataFrames, and the ML library. It addresses challenges faced with RDDs for scalability while presenting the advantages of transitioning to DataFrames, which optimize performance and simplify algorithm development. The future of ML in Spark focuses on efficient scaling and improved usability through better resource management and optimization techniques.

SSR: Structured Streaming for R and Machine Learningfelixcss

The document outlines an overview of structured streaming in Apache Spark, specifically integrating machine learning with the R programming language. It discusses streaming concepts, the benefits of using R for statistical computing, and practical implementation challenges when combining streaming data and machine learning. Additionally, it highlights the use of user-defined functions in R to facilitate data processing and model updates.

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

The document discusses Databricks' development of a next-generation data pipeline utilizing Apache Spark, highlighting challenges like fault tolerance and scalability. It outlines the architecture of their data pipeline, including real-time and batch processing capabilities, and shares lessons learned regarding efficiency and cost management. The conclusion emphasizes the benefits of Databricks and Apache Spark as a unified platform for ETL, data warehousing, and analytics.

Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks

The document presents a talk on building, scaling, and deploying deep learning pipelines using Apache Spark, highlighting its advantages over traditional methods. It covers topics such as the workflow of deep learning, the integration of Spark with deep learning libraries, and examples of applying pre-trained models. The presentation emphasizes the simplicity and efficiency of Spark for deep learning tasks, along with future developments in this area.

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

The performance optimization case study showcases how Spark and Scala shattered Hadoop's sorting records by utilizing efficient in-memory processing and advanced engineering techniques. Key advancements included a sort-based shuffle, native network transport with Netty, and clever application-level optimizations to enhance cache and garbage collection performance. This effort culminated in achieving a new sorting record for 100TB and 1PB datasets, marking a significant milestone in big data processing.

Large-Scale Data Science in Apache Spark 2.0Databricks

This document discusses the enhancements in Apache Spark 2.0, focusing on its scalability for large-scale data science and AI through improved hardware and user scalability. It emphasizes the use of structured APIs for efficient data manipulation and introduces new features for deep learning and integration with existing Python and R libraries. The content highlights Spark's capabilities in parallelizing computations and the ease of building complex data science models with high-level APIs.

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

This document discusses Apache Spark and analytics in the cloud. It summarizes that there is a gap between the growth of data and ability to perform real-time analytics. It introduces Databricks as a cloud-hosted platform that can democratize big data by providing an integrated workspace with automated Apache Spark management and production-ready applications. Databricks also provides the first end-to-end security solution for Apache Spark to address challenges in securing analytics.

Spark Meetup at UberDatabricks

1) Uber uses Spark and Hadoop to process large amounts of transportation data in real-time and batch. This includes building pipelines to ingest trip data from databases into a data warehouse within 1-2 hours. 2) Paricon is Uber's first Spark application which infers schemas from raw JSON data, converts it to Parquet format for faster querying, and validates the results. It processes over 15TB of data daily. 3) Future work includes building a SQL-based ETL platform on Spark, open sourcing SQL-on-Hadoop, and creating a machine learning platform with Spark and a real-time analytics system called Apollo using Spark Streaming.

Distributed ML in Apache SparkDatabricks

The document discusses Apache Spark, highlighting its role as a fast, easy-to-use engine for big data computing with a strong focus on machine learning (ML) support through its ML library, MLlib. It outlines common challenges in ML projects, the functionality of DataFrames for data manipulation, and key optimizations within the library. Additionally, the document outlines future developments and the collaborative community around Apache Spark.

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

The document discusses the integration of deep learning with Apache Spark, highlighting various frameworks and methods for tuning, monitoring, and utilizing GPU instances in Spark. It emphasizes the ongoing challenges in achieving a cohesive deep learning framework within Spark and presents strategies for effective data management and job scheduling. The talk also addresses the importance of monitoring for deep learning tasks and the specific requirements for successful deployment in distributed environments.

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

The document discusses best practices for using Apache Spark in big data architectures, emphasizing the importance of choosing appropriate data storage solutions based on specific use cases. It outlines various scenarios where Spark excels, such as data transformation and ETL, while also highlighting inefficiencies in random access queries and frequent updates. Additionally, it presents solutions for overcoming common limitations in data processing with Spark, advocating for the integration of traditional databases where necessary.

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

The document summarizes a meetup on Apache Spark hosted by Data Science London. It introduces the speakers - Sameer Farooqui, Doug Bateman, and Jon Bates - and their backgrounds in data science and Spark training. The agenda includes talks on a power plant predictive modeling demo using Spark and different approaches to parallelizing machine learning algorithms in Spark like model, divide and conquer, and data parallelism. It also provides overviews of Spark's machine learning library MLlib and common algorithms. The goal is for attendees to learn about Spark's unified engine and how to apply different machine learning techniques at scale.

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Enabling exploratory data science with Spark and RDatabricks

The document discusses the integration of Apache Spark with R through the sparkr package, which facilitates the use of R's data manipulation capabilities alongside Spark's distributed computing. It highlights Spark's features such as real-time streaming, machine learning, and scalability, while addressing R's limitations in handling large datasets. The document also provides an overview of the sparkr architecture and outlines a roadmap for future features and use cases in exploratory data analysis.

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

The document provides an overview of Apache Spark 2.0 on Databricks, detailing its architecture, major new features, and improvements such as unified APIs and structured streaming capabilities. It emphasizes the role of Databricks in simplifying big data processing and introduces participants to various functionalities and workshops focused on Spark's dataframes, datasets, and performance optimizations through Project Tungsten. The presentation includes practical sessions to familiarize users with structured streaming and batch processing using Spark SQL.

Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks

Project Hydrogen integrates big data and AI within Apache Spark, emphasizing the need for a unified framework to enhance machine learning systems and their performance. It outlines the challenges and solutions for distributed training, optimized data exchange, and the importance of leveraging both GPU and CPU resources in hybrid clusters. The initiative aims to provide a streamlined approach to handling complex data scenarios and improve integration with AI frameworks.

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

This document discusses accelerating Spark ML models with Redis modules. It provides an overview of Redis and Spark, and describes how Redis modules can add new capabilities like secondary indexes, time series, and machine learning. The document demonstrates a Redis ML module that implements random forests and decision trees. It shows how Spark ML models can be trained, saved to Redis for low-latency serving, and evaluated directly in Redis for improved performance over Spark alone.

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

Spark has evolved its APIs and engine over the last 6 years to combine the best aspects of previous systems like databases, MapReduce, and data frames. Its latest structured APIs like DataFrames provide a declarative interface inspired by data frames in R/Python for ease of use, along with optimizations from databases for performance and future-proofing. This unified approach allows Spark to scale massively like MapReduce while retaining flexibility.

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

The document discusses the major features and enhancements in Apache Spark 2.3, including improvements in continuous processing, stream-stream joins, and PySpark performance. Key elements include the new structured streaming execution mode that offers low latency, support for various data formats, and integration with Kubernetes. The emphasis is placed on building robust streaming applications and the advantages of using Spark's unified analytics platform.

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

The document provides an overview of Apache Spark 1.6, highlighting key features such as unified memory management, performance improvements, and enhancements to the Spark SQL and MLlib. It details the various APIs, new algorithms, and improvements in data processing, as well as the pipeline persistence for machine learning models. The release is set to ship through the Apache Foundation in December, with a focus on collaboration and community engagement.

Operational Tips for Deploying SparkDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

The document is a workshop overview introducing Apache Spark 2.x and its features, focusing on DataFrames, Datasets, and Spark SQL. It covers the architecture of Spark, its deployment modes, major improvements in version 2.x, and structured streaming capabilities for handling real-time data processing. Additionally, it includes links to resources for hands-on practice with notebooks on Spark SQL and DataFrames.

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

The document discusses how to extend Apache Spark APIs without modifying Spark source code using Scala's "Enrich My Library" pattern. It provides an example of adding a .validate() method to Dataset objects to enable validation checks. The pattern involves defining an implicit class that augments existing types with new methods. This allows validation classes to integrate seamlessly with Spark jobs while keeping code concise, isolated and testable. Other uses like metrics collection and logging are also discussed.

Spark Summit EU talk by Tim HunterSpark Summit

This document summarizes Timothée Hunter's presentation on TensorFrames, which allows running Google TensorFlow models on Apache Spark. Some key points: - TensorFrames embeds TensorFlow into Spark to enable distributed numerical computing on big data. This leverages GPUs to speed up computationally intensive machine learning algorithms. - An example demonstrates speedups from using TensorFrames and GPUs for kernel density estimation, a non-parametric statistical technique. - Future improvements include better integration with Tungsten in Spark for direct memory copying and columnar storage to reduce communication costs.

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit

The document discusses the RiseLab initiative, focused on developing an open-source data analytics stack that facilitates intelligent, real-time decisions using live data. It emphasizes the importance of making faster, personalized, and robust decisions while ensuring privacy and security. The initiative involves collaboration with industry partners to enhance algorithms and tools for decision-making, including reinforcement learning applications.

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

In 2017, major trends in big data and Apache Spark included a focus on addressing compute bottlenecks, democratizing access to big data, and enhancing production applications. Significant advancements were made with Apache Spark 2.0, emphasizing structured APIs, performance enhancements, and integrating streaming capabilities. The increasing emphasis on continuous applications and real-time metrics reflects a shift from analytics to production use in the big data landscape.

More Related Content

What's hot (20)

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Spark Meetup at UberDatabricks

Distributed ML in Apache SparkDatabricks

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Enabling exploratory data science with Spark and RDatabricks

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Operational Tips for Deploying SparkDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Spark Summit EU talk by Tim HunterSpark Summit

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Spark Meetup at UberDatabricks

Distributed ML in Apache SparkDatabricks

Tuning and Monitoring Deep Learning on Apache SparkDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Enabling exploratory data science with Spark and RDatabricks

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks

Operational Tips for Deploying SparkDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks

Spark Summit EU talk by Tim HunterSpark Summit

Viewers also liked (20)

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Introduction to Apache Spark EcosystemBojan Babic

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016René Pfitzner

The document discusses trends and insights from the Spark Summit 2016, highlighting advancements in Spark 2.0 including improvements in RDDs, DataFrames, and Datasets. It also covers emerging trends such as Streaming 2.0, GraphFrames, and the growing importance of deep-learning technologies. Additionally, it shares best practices for utilizing Spark for real-time computations and the importance of community resources.

Introduction to HiveUday Vakalapudi

This document provides an introduction to Hive, including: - What Hive is and why it is used to run SQL queries on Hadoop data as MapReduce jobs. - Hive's logical table/physical location/data format architecture. - An overview of Hive's architecture and metastore configuration. - A comparison of Hive's schema-on-read approach versus traditional databases' schema-on-write. - Descriptions of Hive's data types and table types, including managed and external tables.

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.

Spark is going to replace Apache Hadoop! Know Why?Edureka!

The document discusses how Spark is emerging to replace Hadoop for big data processing. It notes that Hadoop MapReduce is limited to batch processing and is not fast enough for real-time processing needs. In contrast, Spark is up to 100 times faster than Hadoop MapReduce, supports both batch and real-time processing, and stores data in memory for faster analysis. A survey is cited showing increasing adoption of Spark over Hadoop in industries handling large volumes of data. The document concludes that while Hadoop will still be used, Spark will replace Hadoop MapReduce as the primary framework for big data applications due to its ability to support real-time processing demands.

Big data spain keynote nov 2016alanfgates

The document outlines the evolution and advancements in the Apache Hadoop ecosystem from its inception in 2006 to 2016, highlighting significant projects like Apache Hive and the improvements in performance and scalability through features like LLAP. It discusses the transition of Hadoop into cloud environments, focusing on architecture, resource management, data governance, and the advantages of cloud storage. Additionally, it emphasizes the importance of enhancing performance through caching and managing workloads effectively within cloud infrastructure.

Hive ACID Apache BigData 2016alanfgates

The document discusses the integration of ACID (Atomicity, Consistency, Isolation, Durability) properties into Apache Hive, highlighting its historical limitations and the need for concurrent data updates. It outlines the SQL changes for managing transactions, data ingestion from streams, and the advantages of using Hive over HBase for transactional processing. Future improvements and unresolved issues are also presented, emphasizing a transition towards more efficient data handling and user experience.

Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz

Taboola utilizes Apache Spark to process terabytes of data daily for real-time content recommendations and analytics, handling over 3 billion daily recommendations. The company benefits from Spark's in-memory computing capabilities, facilitating faster data processing compared to traditional Hadoop methods. Key technologies employed include Mesos for resource management, Cassandra for data storage, and various monitoring tools to optimize Spark's performance.

Apache Spark 101Abdullah Çetin ÇAVDAR

This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.

2016 spark surveyAbhishek Choudhary

The 2016 Apache Spark Survey highlights continued growth in the Spark community with increased user adoption across various industries, particularly in cloud deployments, streaming, and machine learning. Over 900 organizations participated, showcasing the growing importance of Spark for building complex solutions and real-time applications. Key trends include higher usage of multiple components, programming languages, and a shift towards public cloud implementations for Spark deployments.

Big data Processing with Apache Spark & ScalaEdureka!

Big Data Trend with Open PlatformJongwook Woo

The document is a presentation by Jongwook Woo from the High-Performance Information Computing Center (HiPIC) at California State University Los Angeles given on February 25, 2017 at the SWRC conference in San Diego, CA. It discusses big data trends with open platforms and provides information on Spark, Hadoop, open data, use cases, and the future of big data. Specifically, it summarizes Jongwook Woo's background and experience, describes what big data is and how Spark improves on Hadoop MapReduce, discusses how Spark can integrate with Hadoop ecosystems, and provides examples of analyzing local business data using Spark.

Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

The document provides an overview of machine learning concepts and techniques using Apache Spark. It discusses supervised and unsupervised learning methods like classification, regression, clustering and collaborative filtering. Specific algorithms like k-means clustering, decision trees and random forests are explained. It also introduces Apache Spark MLlib and how to build machine learning pipelines and models with Spark ML APIs.

PySpark Best PracticesCloudera, Inc.

This document discusses best practices for using PySpark. It covers: - Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle. - Recommended project structure with modules for data I/O, feature engineering, and modeling. - Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections. - Tips for testing like unit testing functions and integration testing the full workflow. - Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.

Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA

This document discusses scaling big data using Apache Spark. It provides an overview of Spark's philosophy of providing a unified engine to support end-to-end applications using high-level APIs. It outlines some of the new features in Apache Spark 2.0, including improvements to structured APIs, structured streaming, and new deep learning and graph processing libraries. It also discusses initiatives by Databricks to grow the Spark community through massive open online courses and a free community edition of the Databricks platform.

Hive Training -- Motivations and Real World Use Casesnzhang

The document discusses Hive, a petabyte-scale data warehouse system built on Hadoop, developed by Facebook to address challenges related to data growth and query performance. It highlights Hive's architecture, use cases, and how it improves data management and querying through SQL-like interfaces, scalability, and extensibility. The text also covers technical details, performance optimization, and the growing open-source community around Hive.

Fast Data Analytics with Spark and PythonBenjamin Bengfort

The document provides a comprehensive guide to fast data analytics using Spark and Python (PySpark), detailing installation instructions, the underlying architecture of Spark, its components like Resilient Distributed Datasets (RDDs), and how to write Spark applications. It explains advanced concepts such as execution models, data flow management, and the benefits of Spark over traditional MapReduce frameworks. Additionally, it includes practical examples and programming models for creating and managing RDDs, as well as operations like transformations and actions.

Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd

This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Introduction to Apache Spark EcosystemBojan Babic

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016René Pfitzner

Introduction to HiveUday Vakalapudi

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

Spark is going to replace Apache Hadoop! Know Why?Edureka!

Big data spain keynote nov 2016alanfgates

Hive ACID Apache BigData 2016alanfgates

Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz

Apache Spark 101Abdullah Çetin ÇAVDAR

2016 spark surveyAbhishek Choudhary

Big data Processing with Apache Spark & ScalaEdureka!

Big Data Trend with Open PlatformJongwook Woo

Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

PySpark Best PracticesCloudera, Inc.

Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA

Hive Training -- Motivations and Real World Use Casesnzhang

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd

Similar to Apache Spark Usage in the Open Source Ecosystem (20)

[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)

The document presents a 2015 survey on Apache Spark usage, highlighting that 74% of respondents are developers and 82% opted for Spark over MapReduce for faster data processing. Key programming languages include Scala (88%), Java (44%), and Python (22%), with 62% of users loading data via Hadoop DFS. The document also illustrates varied industry engagement, with significant interest in Spark's capabilities for event stream processing and integrating with different infrastructure technologies.

Started with-apache-sparkHappiest Minds Technologies

The document discusses the evolution and significance of big data, highlighting Apache Spark as a pivotal open-source framework that addresses diverse use cases for data processing and analytics in a business context. It outlines features like machine learning capabilities, real-time data processing, and the integration of Spark with existing Hadoop environments, including various misconceptions about its operation. The author concludes that leveraging tools like Apache Spark can significantly enhance performance and insights for businesses managing large data sets.

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

The document discusses the integration of R with Apache Spark, highlighting Spark's capabilities in real-time streaming, machine learning, SQL, and graph processing. It introduces the SparkR package, which allows R users to manipulate and analyze large datasets using Spark's distributed computing power. The document emphasizes the advantages of combining R's rich ecosystem and flexibility with Spark's scalability and performance.

Contributing to Apache Spark 3Holden Karau

This document discusses contributing to Apache Spark. It provides an overview of finding issues to work on, the different components of Spark one could contribute to, and the process for contributing code changes through pull requests and code reviews. Key steps include searching Spark's JIRA issue tracker for starter issues, choosing a component to work in, making code and test changes, submitting a pull request for review, addressing review feedback, and getting the change merged once approved.

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.

Sparkr sigmodwaqasm86

This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework: - SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R. - The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer. - SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut

Koalas: Unifying Spark and pandas APIsXiao Li

The document discusses Koalas, a Python library announced in April 2019 that provides a familiar pandas API on top of Apache Spark, aiming to unify the data manipulation ecosystems of pandas and Spark. It highlights the advantages of using Spark for large-scale data processing, including faster performance and better execution optimization. The Koalas project is actively maintained, with ongoing enhancements and community contributions, suggesting it is gaining traction among data professionals.

Apache Spark in IndustryDorian Beganovic

1. Apache Spark is an open source cluster computing framework for large-scale data processing. It is compatible with Hadoop and provides APIs for SQL, streaming, machine learning, and graph processing. 2. Over 3000 companies use Spark, including Microsoft, Uber, Pinterest, and Amazon. It can run on standalone clusters, EC2, YARN, and Mesos. 3. Spark SQL, Streaming, and MLlib allow for SQL queries, streaming analytics, and machine learning at scale using Spark's APIs which are inspired by Python/R data frames and scikit-learn.

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

This document outlines a workshop on Apache Spark, detailing its features, such as fast cluster computing, in-memory processing, and support for multiple programming languages. It emphasizes the importance of Spark for data science and machine learning applications, describing its capabilities for data querying and real-time processing. Additionally, it includes practical information on using Spark with various programming environments, such as Python and R, and offers resources for further learning.

Running R at Scale with Apache Arrow on SparkDatabricks

The document discusses the integration of R with Spark using the Apache Arrow framework, highlighting the evolution and features of the sparklyr package across various versions. It details how to securely connect to Spark, read and manipulate data, and leverage machine learning and streaming capabilities while optimizing performance with Arrow. Additionally, it explains how to use Arrow for efficient data serialization and access within R, enhancing productivity for statistical computing tasks.

39.-Introduction-to-Sparkspark and all-1.pdfajajkhan16

The document provides an introduction to Apache Spark, highlighting its superiority over Hadoop in processing speed and flexibility for a variety of workloads. Spark utilizes in-memory computing and supports multiple programming languages while integrating with Hadoop for storage. Key components of Spark, including Spark SQL, Spark Streaming, Spark MLlib, and GraphX, enhance its capabilities for data processing, analytics, and machine learning.

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondDatabricks

Holden Karau discusses ways to contribute to the Apache Spark community, sharing insights on the development process and various contribution methods, including coding, reviews, documentation, and community engagement. The document contains detailed information on getting started with contributions, the importance of the Jira issue tracker, and how to navigate the Spark codebase. It also highlights the significance of code reviews within the open-source community and encourages new contributors by outlining potential paths in Apache Spark.

Big data analysis using spark r publishedDipendra Kusi

SparkR enables large scale data analysis from R by leveraging Apache Spark's distributed processing capabilities. It allows users to load large datasets from sources like HDFS, run operations like filtering and aggregation in parallel, and build machine learning models like k-means clustering. SparkR also supports data visualization and exploration through packages like ggplot2. By running R programs on Spark, users can analyze datasets that are too large for a single machine.

Spark for big data analyticsEdureka!

This document discusses Apache Spark, an open-source cluster computing framework for big data processing. It provides an overview of Spark, how it fits into the Hadoop ecosystem, why it is useful for big data analytics, and hands-on analysis of data using Spark. Key features that make Spark suitable for big data analytics include simplifying data analysis, built-in machine learning and graph processing libraries, support for multiple programming languages, and faster performance than Hadoop MapReduce.

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Codemotion

The document covers the development of applications in the era of big data using Scala and Spark, detailing the history, features, and modules of both technologies. It highlights the reactive manifesto principles, Scala's object-oriented and functional features, as well as Spark's capabilities for big data processing through its Resilient Distributed Datasets (RDD) and various modules. Additionally, it provides information on useful tools, learning resources, and mentions an upcoming training event in Milan.

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses: 1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka. 2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems. 3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

The document discusses the challenges and advancements in utilizing PySpark and other non-JVM languages for big data processing. It emphasizes the historical issues around data serialization and integration, particularly for Python users, but also highlights ongoing improvements through projects like Apache Arrow that aim to enhance interoperability. The author advocates for leveraging emerging tools and technologies to improve performance and usability in big data environments.

Strata NYC 2015 - What's coming for the Spark communityDatabricks

Patrick Wendell discusses the advancements and direction of Apache Spark, focusing on its technical roadmap and community developments. Key highlights include the introduction of higher-level APIs for enhanced developer productivity, improved performance of core execution primitives, and a growing ecosystem of libraries and deployment options. The Spark community has expanded significantly, with increasing usage and integration of various data sources, making big data processes more accessible and efficient.

Introducing Koalas 1.0 (and 1.1)Takuya UESHIN

Takuya Ueshin introduced Koalas 1.0 and 1.1, which provide a pandas-like API for Apache Spark. Koalas aims to unify the pandas and PySpark ecosystems with a familiar API, allowing seamless scaling of pandas code using Koalas. Major updates in Koalas 1.0 include Spark 3.0 support, pandas 1.0 support, and introducing Spark-specific functions. Koalas 1.1 focuses on API extensions and configuring plotting backends. The roadmap includes improving API coverage and examples to make Koalas easier to use.

Liferay & Big Data Dev Con 2014Miguel Pastor

The document discusses big data architectures and technologies. It introduces concepts like Hadoop, HDFS, MapReduce, Spark, Storm and Kafka. It proposes a reference architecture using these technologies with data sources like databases, user tracking, logs and streaming data. The architecture includes an event broker to handle streaming data which is then processed via Spark, Storm or Hadoop and stored in data warehouses or search indexes. It also provides examples of using these technologies for analytics, machine learning and graph processing.

[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)

Started with-apache-sparkHappiest Minds Technologies

Strata NYC 2015 - Supercharging R with Apache SparkDatabricks

Contributing to Apache Spark 3Holden Karau

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Sparkr sigmodwaqasm86

Koalas: Unifying Spark and pandas APIsXiao Li

Apache Spark in IndustryDorian Beganovic

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Running R at Scale with Apache Arrow on SparkDatabricks

39.-Introduction-to-Sparkspark and all-1.pdfajajkhan16

Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondDatabricks

Big data analysis using spark r publishedDipendra Kusi

Spark for big data analyticsEdureka!

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Codemotion

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

Strata NYC 2015 - What's coming for the Spark communityDatabricks

Introducing Koalas 1.0 (and 1.1)Takuya UESHIN

Liferay & Big Data Dev Con 2014Miguel Pastor

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The document discusses the concept of a data lakehouse, highlighting the integration of structured, textual, and analog/IOT data. It emphasizes the importance of common identifiers and universal connectors for meaningful analytics across different data types, ultimately aiming to improve healthcare and manufacturing outcomes through effective data analysis. The presentation outlines the challenges of managing diverse data formats and the potential for data-driven insights to enhance quality of life.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

The document compares data lakehouses and data warehouses, outlining their similarities and differences. Both serve analytical processing and contain vetted, historical data, but the data lakehouse handles a much larger volume of machine-generated data and features fundamentally different structures from transaction-based data warehouses. Ultimately, they are presented as related yet distinct entities in the realm of data management.

Data Lakehouse Symposium | Day 2Databricks

The Data Lakehouse Symposium held in February 2022 discussed the evolution of data management from data warehouses to lakehouses, emphasizing the integration of governance and metadata. It highlighted the challenges companies face in utilizing various types of data, particularly unstructured textual data, and the importance of adding context for effective analysis. The presentation also examined strategies for transforming unstructured data into structured formats to enable better decision-making and analytical processes.

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

The document outlines the challenges and considerations for migrating from Hadoop to Databricks, emphasizing the complexities of the Hadoop ecosystem and the advantages of a modern cloud-based data architecture. It provides a comprehensive migration plan that includes internal assessments, technical planning, and execution while addressing key topics such as data migration, security, and SQL integration. Specific tools and methodologies for effective transition and enhanced performance in data analytics are also discussed.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Zillow's Data Governance Platform team addresses data quality challenges by creating a centralized platform that enhances visibility and standardizes data quality rules. The platform includes self-service capabilities and integrates with data lineage, allowing for built-in alerting and scalable onboarding. Key takeaways emphasize the importance of early alerting, collaboration, and the shared responsibility for maintaining high-quality data to improve decision-making.

Learn to Use Databricks for Data ScienceDatabricks

The document outlines the challenges and workflows involved in data science, emphasizing the need for proper setup and resource management. It highlights the importance of sharing results with stakeholders and describes how Databricks' lakehouse platform simplifies these processes by integrating data sources and providing essential tools for data analysis. Overall, the goal is to help data scientists focus on their core analytical work rather than dealing with setup complexities.

Why APM Is Not the Same As ML MonitoringDatabricks

The document discusses the distinctions between application performance monitoring (APM) and machine learning (ML) monitoring, emphasizing the unique challenges of ML monitoring, such as the need for intelligent detection and alerting. It outlines the essential components of ML monitoring, including statistical summarization, distribution comparison, and actionable alerts based on model performance. Additionally, it introduces Verta's end-to-end MLOps platform designed to meet the specialized needs of ML monitoring throughout the entire model lifecycle.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Elijah Ben Izzy, a Data Platform Engineer at Stitch Fix, discusses building abstractions for machine learning operations to optimize workflows and enhance the separation of concerns between data science and platform engineering. The presentation highlights the importance of a custom-built model envelope for seamless integration and management of ML models, as well as advancements in deployment and inference processes. Future directions include enhanced production monitoring and sophisticated feature integration to further streamline data science workflows.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

The document discusses stage-level scheduling and resource allocation in Apache Spark to enhance big data and AI integration. It outlines various resource requirements such as executors, memory, and accelerators, while presenting benefits like improved hardware utilization and simplified application pipelines. Additionally, it introduces the RAPIDS Accelerator for Spark and distributed deep learning with Horovod, emphasizing performance optimizations and future enhancements.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

The document discusses the importance of data conversion between Spark and deep learning frameworks like TensorFlow and PyTorch. It highlights key pain points, such as challenges in migrating from single-node to distributed training and the complexity of saving and loading data. Additionally, it introduces the Spark Dataset Converter, which simplifies data handling while training deep learning models and offers best practices for efficient usage.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

This document discusses the integration of Apache Spark with Kubernetes on Google Cloud, highlighting its advantages for running data engineering and machine learning workloads within existing infrastructure. It outlines benefits such as improved cost optimization, faster scaling, and enhanced resource management through Google Kubernetes Engine (GKE) and Dataproc, while detailing implementation steps and monitoring options. Additionally, it covers the compatibility with big data ecosystem tools, job execution, and enterprise security features.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

The document discusses the Sawtooth Windows Zipline, a feature engineering framework focusing on machine learning with structured data. It emphasizes the importance of real-time, stable, and consistent features for model training and serving, while highlighting the challenges of data sources and the intricacies of aggregations. Key topics include model complexity, data quality, and various types of windowed aggregations for efficient data processing.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

The document discusses the integration of Redis with Apache Spark for managing long-running batch jobs and distributed counters. It outlines the challenges faced in submitting queries and the inefficiencies of existing solutions, proposing a system that utilizes Redis for queuing and job status communication. Key workflows and code views are provided to demonstrate the proposed solutions for efficient query handling and data processing.

Re-imagine Data Monitoring with whylogs and SparkDatabricks

The document discusses the challenges of monitoring machine learning data, emphasizing how traditional data analysis techniques fall short in addressing issues in ML data pipelines. It introduces the open-source library Whylogs for data logging, highlighting its lightweight profiling methods suitable for large datasets and integration with Apache Spark. Key topics include data quality problems, the need for scalable monitoring, and approaches for logging and analyzing ML data effectively.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

The document discusses Raven, an optimizer for machine learning prediction queries at Microsoft, focusing on its ability to improve the performance of SQL-based ML operations. It details how Raven integrates with Azure data engines, utilizing techniques like model projection pushdown and model-to-SQL translation to enhance query efficiency. Performance evaluations indicate that Raven significantly outperforms existing ML runtimes in various scenarios, achieving speed increases of up to 44 times compared to traditional approaches.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

The document outlines the use of Spark for processing large datasets in automated driving applications, focusing on semantic segmentation and the challenges of moving from prototype to production. It presents the architecture of the system, covering ETL processes, model training, and inference, while addressing design considerations like scaling, security, and governance. Key takeaways emphasize the importance of leveraging cloud-based solutions and effective workflow management to enhance the development of perception software for autonomous vehicles.

Massive Data Processing in Adobe Using Delta LakeDatabricks

The document discusses massive data processing at Adobe using Delta Lake, highlighting various aspects such as data representation, schema evolution, and challenges in data ingestion. It emphasizes the performance benefits of utilizing Delta Lake for handling large-scale data efficiently, while considering issues like schema management and replication lag. Key features like ACID transactions and lazy schema on-read approaches are also outlined to address the complexities of multi-tenant data architecture.

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

How the US Navy Approaches DevSecOps with Raise 2.0Anchore

Join us as Anchore's solutions architect reveals how the U.S. Navy successfully approaches the shift left philosophy to DevSecOps with the RAISE 2.0 Implementation Guide to support its Cyber Ready initiative. This session will showcase practical strategies for defense application teams to pivot from a time-intensive compliance checklist and mindset to continuous cyber-readiness with real-time visibility. Learn how to break down organizational silos through RAISE 2.0 principles and build efficient, secure pipeline automation that produces the critical security artifacts needed for Authorization to Operate (ATO) approval across military environments.

Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...Intelli grow

Migrating to Azure Cosmos DB the Right WayAlexander (Alex) Komyagin

Code and No-Code Journeys: The Coverage OverlookApplitools

Insurance Underwriting Software Enhancing Accuracy and EfficiencyInsurance Tech Services

Rierino Commerce Platform - CMS SolutionRierino

wAIred_RabobankIgniteSession_12062025.pptxSimonedeGijt

In today's world, artificial intelligence (AI) is transforming the way we learn. This talk will explore how we can use AI tools to enhance our learning experiences, by looking at some (recent) research that has been done on the matter. But as we embrace these new technologies, we must also ask ourselves: Are we becoming less capable of thinking for ourselves? Do these tools make us smarter, or do they risk dulling our critical thinking skills? This talk will encourage us to think critically about the role of AI in our education. Together, we will discover how to use AI to support our learning journey while still developing our ability to think critically.

Advanced Token Development - Decentralized Innovationarohisinghas720

Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWSBradBedford3

Creating meaningful, real-time engagement across channels is essential to building lasting business relationships. Discover how AWS, in collaboration with Deloitte, set up one of Adobe's first instances of Journey Optimizer B2B Edition to revolutionize customer journeys for B2B audiences. This session will share the use cases the AWS team has the implemented leveraging Adobe's Journey Optimizer B2B alongside Marketo Engage and Real-Time CDP B2B to deliver unified, personalized experiences and drive impactful engagement. They will discuss how they are positioning AJO B2B in their marketing strategy and how AWS is imagining AJO B2B and Marketo will continue to work together in the future. Whether you’re looking to enhance customer journeys or scale your B2B marketing efforts, you’ll leave with a clear view of what can be achieved to help transform your own approach. Speakers: Britney Young Senior Technical Product Manager, AWS Erine de Leeuw Technical Product Manager, AWS

About Certivo | Intelligent Compliance Solutions for Global Regulatory Needscertivoai

Certivo delivers intelligent compliance solutions designed to simplify and automate regulatory management for modern businesses in the USA, UK, and EU. Our AI-driven compliance platform helps enterprises navigate complex requirements with ease, offering real-time automated compliance monitoring and powerful product compliance software. At Certivo, we’re driven by a mission to transform how companies handle compliance, reducing risk and boosting operational efficiency. Discover our core values, vision, and innovation behind our trusted compliance management solutions. Whether you're in life sciences, automotive, or tech, Certivo helps you simplify regulatory compliance and scale faster with confidence.

OpenTelemetry 101 Cloud Native BarcelonaImma Valls Bernaus

MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptxMaharshi Mallela

SAP Datasphere Catalog L2 (2024-02-07).pptxHimanshuSachdeva46

Shell Skill Tree - LabEx Certification (LabEx)VICTOR MAESTRE RAMIREZ

FME as an Orchestration Tool - Peak of Data & AI 2025Safe Software

SAP PM Module Level-IV Training Complete.pptMuhammadShaheryar36

Making significant Software Architecture decisionsBert Jan Schrijver

Transmission Media. (Computer Networks)S Pranav (Deepu)

INTRODUCTION:TRANSMISSION MEDIA • A transmission media in data communication is a physical path between the sender and the receiver and it is the channel through which data can be sent from one location to another. Data can be represented through signals by computers and other sorts of telecommunication devices. These are transmitted from one device to another in the form of electromagnetic signals. These Electromagnetic signals can move from one sender to another receiver through a vacuum, air, or other transmission media. Electromagnetic energy mainly includes radio waves, visible light, UV light, and gamma ra

dp-700 exam questions sample docume .pdfpravkumarbiz

Artificial Intelligence Workloads and Data Center ManagementSandeepKS52

Data centers play a crucial role in the modern digital landscape, serving as the backbone for data storage, processing, and management. Understanding the structure and function of these facilities is essential, as they house the technology that supports various applications and services. The use of Kubernetes and container orchestration has transformed how software is deployed and managed, allowing for greater efficiency and scalability in handling applications. Additionally, the management of AI workloads presents unique challenges and opportunities, as organizations seek to optimize resources and performance for complex algorithms and data processing tasks. Together, these topics provide a comprehensive overview of the technologies and strategies that drive today’s information systems.

How the US Navy Approaches DevSecOps with Raise 2.0Anchore

Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...Intelli grow

Migrating to Azure Cosmos DB the Right WayAlexander (Alex) Komyagin

Code and No-Code Journeys: The Coverage OverlookApplitools

Insurance Underwriting Software Enhancing Accuracy and EfficiencyInsurance Tech Services

Rierino Commerce Platform - CMS SolutionRierino

wAIred_RabobankIgniteSession_12062025.pptxSimonedeGijt

Advanced Token Development - Decentralized Innovationarohisinghas720

Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWSBradBedford3

About Certivo | Intelligent Compliance Solutions for Global Regulatory Needscertivoai

OpenTelemetry 101 Cloud Native BarcelonaImma Valls Bernaus

MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptxMaharshi Mallela

SAP Datasphere Catalog L2 (2024-02-07).pptxHimanshuSachdeva46

Shell Skill Tree - LabEx Certification (LabEx)VICTOR MAESTRE RAMIREZ

FME as an Orchestration Tool - Peak of Data & AI 2025Safe Software

SAP PM Module Level-IV Training Complete.pptMuhammadShaheryar36

Making significant Software Architecture decisionsBert Jan Schrijver

Transmission Media. (Computer Networks)S Pranav (Deepu)

dp-700 exam questions sample docume .pdfpravkumarbiz

Artificial Intelligence Workloads and Data Center ManagementSandeepKS52

Apache Spark Usage in the Open Source Ecosystem

1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki

2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2

3. Stackoverflow 2016 trending tech 3

4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3

5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5

6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6

7. Installing libraries is easy 7

8. Python Packages 8

9. Most popular Python packages 9

10. What is test_helper? 10

11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5

12. Python package categories 12

13. What packages go together? 13

14. Scala Packages 14

15. Most popular Scala libraries 15

16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys

17. Scala package categories 17

18. What libraries go together? 18

19. R Packages 19

20. Most popular R packages 20

21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr

22. R package categories 22

23. Comparing Python, Scala & R 23

24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages

25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25

26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0

27. Thank you!

28. What packages are used together? 28