SlideShare a Scribd company logo
Koalas: Unifying Spark and
pandas APIs
1
Spark + AI Summit Europe 2019
Tim Hunter
Takuya Ueshin
About
Takuya Ueshin, Software Engineer at Databricks
• Apache Spark committer and PMC member
• Focusing on Spark SQL and PySpark
• Koalas committer
Tim Hunter, Software Engineer at Databricks
• Co-creator of the Koalas project
• Contributes to Apache Spark MLlib, GraphFrames, TensorFrames and
Deep Learning Pipelines libraries
• Ph.D Machine Learning from Berkeley, M.S. Electrical Engineering from
Stanford
Outline
● pandas vs Spark at a high level
● why Koalas (combine everything in one package)
○ key differences
● current status & new features
● demo
● technical topics
○ InternalFrame
○ Operations on different DataFrames
○ Default Index
● roadmap
Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
4
pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
5
Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
6
7
pandas DataFrame Spark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b'])
Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'),
df['c2'].alias('b'))
Value count df['col'].value_counts() df.groupBy(df['col']).count().order
By('count', ascending = False)
pandas DataFrame vs Spark DataFrame
A short example
8
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)
Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
9
Quickly gaining traction
10
Bi-weekly releases!
> 500 patches merged since
announcement
> 20 significant contributors
outside of Databricks
> 8k daily downloads
A short example
11
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
Koalas
- Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
- Unify pandas API and Spark API, but pandas first
- pandas APIs that are appropriate for distributed
dataset
- Easy conversion from/to pandas DataFrame or
numpy array.
12
Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not maintain row order
Performance when working at scale
13
Current status
Bi-weekly releases, very active community with daily changes
The most common functions have been implemented:
- 60% of the DataFrame / Series API
- 60% of the DataFrameGroupBy / SeriesGroupBy API
- 15% of the Index / MultiIndex API
- to_datetime, get_dummies, …
14
New features
- 80% of the plot functions (0.16.0-)
- Spark related functions (0.8.0-)
- IO: to_parquet/read_parquet, to_csv/read_csv,
to_json/read_json, to_spark_io/read_spark_io,
to_delta/read_delta, ...
- SQL
- cache
- Support for multi-index columns (90%) (0.16.0-)
- Options to configure Koalas’ behavior (0.17.0-)
15
Demo
16
Challenge: increasing scale
and complexity of data
operations
Struggling with the “Spark
switch” from pandas
More than 10X faster with
less than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas
Internal Frame
Internal immutable metadata.
- holds the current Spark DataFrame
- manages mapping from Koalas column names to Spark
column names
- manages mapping from Koalas index names to Spark column
names
- converts between Spark DataFrame and pandas DataFrame
18
InternalFrame
19
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
20
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
Koalas
DataFrame
API call copy with new state
InternalFrame
21
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Koalas
DataFrame
API call copy with new state
Only updates metadata
kdf.set_index(...)
InternalFrame
22
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
API call copy with new state
e.g., inplace=True
kdf.dropna(..., inplace=True)
Operations on different DataFrames
We only allow Series derived from the same DataFrame by default.
23
OK
- df.a + df.b
- df['c'] = df.a * df.b
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
Operations on different DataFrames
We only allow Series derived from the same DataFrame by default.
Equivalent Spark code?
24
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
sdf['a'] + sdf['b'])
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
???
sdf1.select(
sdf1['a'] + sdf2['b'])
Operations on different DataFrames
We only allow Series derived from the same DataFrame by default.
Equivalent Spark code?
25
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
sdf['a'] + sdf['b'])
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
???
sdf1.select(
sdf1['a'] + sdf2['b'])AnalysisException!!
Operations on different DataFrames
ks.set_option('compute.ops_on_diff_frames', True)
Equivalent Spark code?
26
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
sdf['a'] + sdf['b'])
OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
sdf1.join(sdf2,
on="_index_")
.select('a * b')
Default Index
Koalas manages a group of columns as index.
The index behaves the same as pandas’.
If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
Each “default index” has Pros and Cons.
27
Default Indexes
Configurable by the option “compute.default_index_type”
See also: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type
28
requires to collect data
into single node
requires shuffle continuous increments
sequence YES YES YES
distributed-sequence NO YES / NO YES / NO
distributed NO NO NO
What to expect?
• Improve pandas API coverage
- rolling/expanding
• Support categorical data types
• More time-series related functions
• Improve performance
- Minimize the overhead at Koalas layer
- Optimal implementation of APIs
29
Getting started
pip install koalas
conda install koalas
Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
10 min tutorial in a Live Jupyter notebook is available from the docs.
30
Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html
31
Koalas Sessions
Koalas: Pandas on Apache Spark (Tutorial)
- 14:30 - @ROOM: G104
AMA: Koalas
- 16:00 - @EXPO HALL
32
Thank you! Q&A?
Get Started at databricks.com/try
33

More Related Content

PPTX
Optimizing Apache Spark SQL Joins
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
誰でもできるスマートシティ向けOSS : FIWAREのはじめかた
PDF
Zabbixを徹底活用してみよう ~4.2の最新情報もご紹介~
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Optimizing Apache Spark SQL Joins
Processing Large Data with Apache Spark -- HasGeek
Spark SQL Deep Dive @ Melbourne Spark Meetup
Introducing DataFrames in Spark for Large Scale Data Science
誰でもできるスマートシティ向けOSS : FIWAREのはじめかた
Zabbixを徹底活用してみよう ~4.2の最新情報もご紹介~
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai

What's hot (20)

PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Change Data Feed in Delta
PDF
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
PDF
【第26回Elasticsearch勉強会】Logstashとともに振り返る、やっちまった事例ごった煮
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Understanding Query Plans and Spark UIs
PDF
Apache Spark Overview
PDF
Productizing Structured Streaming Jobs
PDF
Apache Spark At Scale in the Cloud
PPTX
Apache Spark Fundamentals
PDF
ストリーム処理を支えるキューイングシステムの選び方
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Spark sql
PDF
はじめてのグラフデータベース 〜 Amazon Neptune と主なユースケース 〜
PDF
3D: DBT using Databricks and Delta
PDF
RDF Semantic Graph「RDF 超入門」
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Optimizing Delta/Parquet Data Lakes for Apache Spark
Change Data Feed in Delta
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
【第26回Elasticsearch勉強会】Logstashとともに振り返る、やっちまった事例ごった煮
Apache Spark in Depth: Core Concepts, Architecture & Internals
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
The Parquet Format and Performance Optimization Opportunities
Understanding Query Plans and Spark UIs
Apache Spark Overview
Productizing Structured Streaming Jobs
Apache Spark At Scale in the Cloud
Apache Spark Fundamentals
ストリーム処理を支えるキューイングシステムの選び方
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Spark sql
はじめてのグラフデータベース 〜 Amazon Neptune と主なユースケース 〜
3D: DBT using Databricks and Delta
RDF Semantic Graph「RDF 超入門」
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Making Data Timelier and More Reliable with Lakehouse Technology
Ad

Similar to Koalas: Making an Easy Transition from Pandas to Apache Spark (20)

PDF
Koalas: Pandas on Apache Spark
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
Koalas: Pandas on Apache Spark
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
Introducing Koalas 1.0 (and 1.1)
PDF
Koalas: Unifying Spark and pandas APIs
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Koalas: Interoperability Between Koalas and Apache Spark
PDF
Koalas: How Well Does Koalas Work?
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Introduction to Spark with Python
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PDF
df: Dataframe on Spark
PDF
Introduction to df
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PDF
pandas.pdf
PDF
pandas (1).pdf
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
PDF
Dive into PySpark
Koalas: Pandas on Apache Spark
Koalas: Unifying Spark and pandas APIs
Koalas: Pandas on Apache Spark
Koalas: Unifying Spark and pandas APIs
Introducing Koalas 1.0 (and 1.1)
Koalas: Unifying Spark and pandas APIs
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: How Well Does Koalas Work?
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Introduction to Spark with Python
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
df: Dataframe on Spark
Introduction to df
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
pandas.pdf
pandas (1).pdf
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
Dive into PySpark
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Transcultural that can help you someday.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Mega Projects Data Mega Projects Data
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
annual-report-2024-2025 original latest.
[EN] Industrial Machine Downtime Prediction
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
modul_python (1).pptx for professional and student
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Transcultural that can help you someday.
IBA_Chapter_11_Slides_Final_Accessible.pptx
ISS -ESG Data flows What is ESG and HowHow
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Mega Projects Data Mega Projects Data
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
SAP 2 completion done . PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
annual-report-2024-2025 original latest.

Koalas: Making an Easy Transition from Pandas to Apache Spark

  • 1. Koalas: Unifying Spark and pandas APIs 1 Spark + AI Summit Europe 2019 Tim Hunter Takuya Ueshin
  • 2. About Takuya Ueshin, Software Engineer at Databricks • Apache Spark committer and PMC member • Focusing on Spark SQL and PySpark • Koalas committer Tim Hunter, Software Engineer at Databricks • Co-creator of the Koalas project • Contributes to Apache Spark MLlib, GraphFrames, TensorFrames and Deep Learning Pipelines libraries • Ph.D Machine Learning from Berkeley, M.S. Electrical Engineering from Stanford
  • 3. Outline ● pandas vs Spark at a high level ● why Koalas (combine everything in one package) ○ key differences ● current status & new features ● demo ● technical topics ○ InternalFrame ○ Operations on different DataFrames ○ Default Index ● roadmap
  • 4. Typical journey of a data scientist Education (MOOCs, books, universities) → pandas Analyze small data sets → pandas Analyze big data sets → DataFrame in Spark 4
  • 5. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib Can deal with a lot of different situations, including: - basic statistical analysis - handling missing data - time series, categorical variables, strings 5
  • 6. Apache Spark De facto unified analytics engine for large-scale data processing (Streaming, ETL, ML) Originally created at UC Berkeley by Databricks’ founders PySpark API for Python; also API support for Scala, R and SQL 6
  • 7. 7 pandas DataFrame Spark DataFrame Column df['col'] df['col'] Mutability Mutable Immutable Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b')) Value count df['col'].value_counts() df.groupBy(df['col']).count().order By('count', ascending = False) pandas DataFrame vs Spark DataFrame
  • 8. A short example 8 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x*df.x)
  • 9. Koalas Announced April 24, 2019 Pure Python library Aims at providing the pandas API on top of Apache Spark: - unifies the two ecosystems with a familiar API - seamless transition between small and large data 9
  • 10. Quickly gaining traction 10 Bi-weekly releases! > 500 patches merged since announcement > 20 significant contributors outside of Databricks > 8k daily downloads
  • 11. A short example 11 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x
  • 12. Koalas - Provide discoverable APIs for common data science tasks (i.e., follows pandas) - Unify pandas API and Spark API, but pandas first - pandas APIs that are appropriate for distributed dataset - Easy conversion from/to pandas DataFrame or numpy array. 12
  • 13. Key Differences Spark is more lazy by nature: - most operations only happen when displaying or writing a DataFrame Spark does not maintain row order Performance when working at scale 13
  • 14. Current status Bi-weekly releases, very active community with daily changes The most common functions have been implemented: - 60% of the DataFrame / Series API - 60% of the DataFrameGroupBy / SeriesGroupBy API - 15% of the Index / MultiIndex API - to_datetime, get_dummies, … 14
  • 15. New features - 80% of the plot functions (0.16.0-) - Spark related functions (0.8.0-) - IO: to_parquet/read_parquet, to_csv/read_csv, to_json/read_json, to_spark_io/read_spark_io, to_delta/read_delta, ... - SQL - cache - Support for multi-index columns (90%) (0.16.0-) - Options to configure Koalas’ behavior (0.17.0-) 15
  • 17. Challenge: increasing scale and complexity of data operations Struggling with the “Spark switch” from pandas More than 10X faster with less than 1% code changes How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas
  • 18. Internal Frame Internal immutable metadata. - holds the current Spark DataFrame - manages mapping from Koalas column names to Spark column names - manages mapping from Koalas index names to Spark column names - converts between Spark DataFrame and pandas DataFrame 18
  • 20. InternalFrame 20 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Spark DataFrame Koalas DataFrame API call copy with new state
  • 21. InternalFrame 21 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Koalas DataFrame API call copy with new state Only updates metadata kdf.set_index(...)
  • 22. InternalFrame 22 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Spark DataFrame API call copy with new state e.g., inplace=True kdf.dropna(..., inplace=True)
  • 23. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. 23 OK - df.a + df.b - df['c'] = df.a * df.b Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b
  • 24. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. Equivalent Spark code? 24 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b ??? sdf1.select( sdf1['a'] + sdf2['b'])
  • 25. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. Equivalent Spark code? 25 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b ??? sdf1.select( sdf1['a'] + sdf2['b'])AnalysisException!!
  • 26. Operations on different DataFrames ks.set_option('compute.ops_on_diff_frames', True) Equivalent Spark code? 26 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) OK - df1.a + df2.b - df1['c'] = df2.a * df2.b sdf1.join(sdf2, on="_index_") .select('a * b')
  • 27. Default Index Koalas manages a group of columns as index. The index behaves the same as pandas’. If no index is specified when creating a Koalas DataFrame: it attaches a “default index” automatically. Each “default index” has Pros and Cons. 27
  • 28. Default Indexes Configurable by the option “compute.default_index_type” See also: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type 28 requires to collect data into single node requires shuffle continuous increments sequence YES YES YES distributed-sequence NO YES / NO YES / NO distributed NO NO NO
  • 29. What to expect? • Improve pandas API coverage - rolling/expanding • Support categorical data types • More time-series related functions • Improve performance - Minimize the overhead at Koalas layer - Optimal implementation of APIs 29
  • 30. Getting started pip install koalas conda install koalas Look for docs on https://koalas.readthedocs.io/en/latest/ and updates on github.com/databricks/koalas 10 min tutorial in a Live Jupyter notebook is available from the docs. 30
  • 31. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute koalas.readthedocs.io/en/latest/development/contributing.html 31
  • 32. Koalas Sessions Koalas: Pandas on Apache Spark (Tutorial) - 14:30 - @ROOM: G104 AMA: Koalas - 16:00 - @EXPO HALL 32
  • 33. Thank you! Q&A? Get Started at databricks.com/try 33