SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Scaling	
  the	
  Python	
  Data	
  
Experience	
  
Wes	
  McKinney	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Marcel	
  Kornacker	
  
JusFn	
  Erickson 	
   	
  Silvius	
  Rus	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes	
  McKinney	
  
•  A	
  key	
  person	
  in	
  building	
  today’s	
  open	
  source	
  Python	
  data	
  community	
  
•  Creator	
  of	
  pandas,	
  a	
  standard	
  Python	
  data	
  wrangling	
  and	
  analyFcs	
  toolkit	
  used	
  
by	
  data	
  scienFsts	
  
•  Author	
  of	
  best-­‐selling	
  canonical	
  text	
  Python	
  for	
  Data	
  Analysis	
  (2012)	
  
•  Formerly	
  Founder/CEO	
  of	
  DataPad	
  (acquired	
  by	
  Cloudera	
  in	
  2014)	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  is	
  popular…	
  
•  Python	
  has	
  become	
  a	
  standard	
  language	
  of	
  data	
  science	
  
•  Why	
  is	
  it	
  popular?	
  
• Maximizes	
  producFvity	
  for	
  data	
  engineers	
  and	
  data	
  scienFsts	
  
• Build	
  robust	
  so[ware	
  and	
  do	
  interacFve	
  data	
  analysis	
  with	
  100%	
  Python	
  code	
  	
  
• Easy-­‐to-­‐learn	
  and	
  makes	
  happy	
  and	
  producFve	
  data	
  teams	
  	
  
• Large,	
  diverse	
  open	
  source	
  development	
  community	
  
• Comprehensive	
  libraries:	
  data	
  wrangling,	
  ML,	
  visualizaFon,	
  etc.	
  
•  Main	
  use	
  case:	
  data	
  science	
  &	
  engineering	
  swiss	
  army	
  knife	
  on	
  small-­‐to-­‐medium	
  
size	
  data	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
…but	
  Python	
  does	
  not	
  scale	
  today	
  
•  Python	
  ecosystem	
  confined	
  to	
  single-­‐node	
  analysis	
  
• Great	
  for	
  smaller	
  data	
  sets	
  
• Requires	
  sampling	
  or	
  aggregaFons	
  for	
  larger	
  data	
  
• Distributed	
  tools	
  compromise	
  in	
  various	
  ways	
  
•  ExtracFng	
  samples	
  or	
  aggregaFons	
  for	
  larger	
  data	
  means:	
  
• “Scales”	
  by	
  losing	
  more	
  fidelity	
  
• AddiFonal	
  ETL	
  overhead	
  to	
  extract	
  samples/aggregaFons	
  
• Loss	
  of	
  producFvity	
  with	
  mulFple	
  languages,	
  tools,	
  etc	
  
• Blocks	
  certain	
  analysis	
  and	
  use	
  cases	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Same	
  Python,	
  now	
  at	
  scale	
  
•  Target	
  user:	
  
• Data	
  scienFsts	
  and	
  data	
  engineers	
  (“Python	
  data	
  users”)	
  
•  Goals:	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Scales	
  to	
  any	
  node	
  and	
  data	
  size	
  
• No	
  compromise	
  in	
  funcFonality	
  or	
  usability	
  
• InteracFve	
  experience	
  at	
  naFve	
  hardware	
  speeds	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  announced?	
  
•  First	
  public	
  release	
  of	
  Ibis	
  
• hgp://ibis-­‐project.org	
  
•  Beta	
  release	
  to	
  Cloudera	
  Labs	
  
•  InviFng	
  usage	
  and	
  community	
  development	
  
•  Apache-­‐licensed	
  open-­‐source	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis’s	
  Vision	
  
•  Uncompromised	
  Python	
  experience	
  
• 100%	
  Python	
  end-­‐to-­‐end	
  user	
  workflows	
  	
  
• Enable	
  integraFon	
  with	
  the	
  exisFng	
  Python	
  data	
  ecosystem	
  (pandas,	
  scikit-­‐
learn,	
  NumPy,	
  etc)	
  
•  InteracFve	
  at	
  big	
  data	
  scale	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Scalability	
  for	
  big	
  data	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Advantages	
  of	
  our	
  approach	
  
•  Analyze	
  big	
  data	
  100%	
  in	
  Python,	
  with	
  the	
  same	
  ease	
  as	
  small/medium	
  data	
  on	
  
the	
  local	
  filesystem	
  
•  Full-­‐fidelity	
  data	
  access	
  
•  Familiar	
  Python	
  experience	
  and	
  integraFon	
  with	
  exisFng	
  Python	
  data	
  libraries	
  
•  Provide	
  a	
  means	
  for	
  Python	
  high	
  performance	
  compuFng	
  tools	
  to	
  be	
  leveraged	
  at	
  
Hadoop-­‐scale	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Beta	
  0.3	
  release 	
  	
  
•  High	
  level	
  Python	
  API	
  for	
  describing	
  analyFcs	
  and	
  ETL	
  that	
  can	
  be	
  executed	
  by	
  
Impala	
  
• Familiar	
  API	
  for	
  users	
  of	
  pandas	
  
• Comprehensive	
  coverage	
  of	
  operaFons	
  expressible	
  as	
  relaFonal	
  data	
  flows	
  
•  Integrated	
  tools	
  for	
  managing	
  data	
  in	
  HDFS	
  
•  Simple	
  workflows	
  to	
  query	
  data	
  files	
  in	
  several	
  formats	
  (Parquet,	
  Avro,	
  Text)	
  
•  pandas	
  data	
  interchange	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis/Impala	
  Joint	
  Roadmap	
  
•  More	
  natural	
  data	
  modeling	
  
• Complex	
  types	
  support	
  
•  IntegraFon	
  with	
  full	
  Python	
  data	
  ecosystem	
  
• Advanced	
  analyFcs	
  +	
  machine	
  learning	
  
• Enable	
  use	
  of	
  performance	
  compuFng	
  tools	
  
•  User	
  extensibility	
  with	
  naFve	
  performance	
  
• In-­‐memory	
  columnar	
  format	
  
• Python-­‐to-­‐LLVM	
  IR	
  compilaFon	
  
•  Workflow	
  and	
  usability	
  tools	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benefits	
  of	
  Ibis	
  
•  Maximize	
  developer	
  producFvity	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Solve	
  big	
  data	
  problems	
  without	
  leaving	
  Python	
  
• Leverage	
  Python	
  skills,	
  ecosystem,	
  and	
  tools	
  
•  Python	
  as	
  first-­‐class	
  language	
  for	
  Hadoop	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Python	
  analysis	
  at	
  any	
  scale	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
wes@cloudera.com	
  

More Related Content

PDF
Running Zeppelin in Enterprise
PDF
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
PDF
Hive2 Introduction -- Interactive SQL for Big Data
PPTX
The Elephant in the Clouds
PPTX
Apache NiFi Crash Course Intro
PDF
Data in the Cloud Crash Course
PDF
Dataflow Management From Edge to Core with Apache NiFi
PDF
Introduction to Hadoop
Running Zeppelin in Enterprise
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
Hive2 Introduction -- Interactive SQL for Big Data
The Elephant in the Clouds
Apache NiFi Crash Course Intro
Data in the Cloud Crash Course
Dataflow Management From Edge to Core with Apache NiFi
Introduction to Hadoop

What's hot (20)

PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
PPTX
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
PDF
Introduction to data flow management using apache nifi
PDF
Apache NiFi Meetup - Princeton NJ 2016
PDF
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
PPTX
SDLC with Apache NiFi
PPTX
Accelerating Big Data Insights
PPTX
Spark Infrastructure Made Easy
PDF
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
PPTX
Building a Smarter Home with Apache NiFi and Spark
PPTX
Apache Zeppelin and Spark for Enterprise Data Science
PDF
What’s new in Apache Spark 2.3 and Spark 2.4
PPTX
Webinar Series Part 5 New Features of HDF 5
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
PDF
Apache NiFi: Ingesting Enterprise Data At Scale
PPTX
Scaling real time streaming architectures with HDF and Dell EMC Isilon
PPTX
OpenStack + Nano Server + Hyper-V + S2D
PPTX
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for Cybersecurity
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Introduction to data flow management using apache nifi
Apache NiFi Meetup - Princeton NJ 2016
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
SDLC with Apache NiFi
Accelerating Big Data Insights
Spark Infrastructure Made Easy
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Building a Smarter Home with Apache NiFi and Spark
Apache Zeppelin and Spark for Enterprise Data Science
What’s new in Apache Spark 2.3 and Spark 2.4
Webinar Series Part 5 New Features of HDF 5
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi: Ingesting Enterprise Data At Scale
Scaling real time streaming architectures with HDF and Dell EMC Isilon
OpenStack + Nano Server + Hyper-V + S2D
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for Cybersecurity
Ad

Viewers also liked (18)

PPTX
BCHS - Final Presentation
PPT
Pharmacy baba
PDF
Pk 08.06 final
PDF
Inferring networks of substitute and complementary products
PDF
Bob’s training programs
PPTX
ETP Introduction for Launch Events
PDF
American Builders Quarterly 12-12-07
PDF
Ob1 unit 4 chapter - 15 - power and politics
RTF
Screenplay - 'Kay'
PDF
PDF
Ob1 unit 4 chapter - 12 - managing teams at work
PDF
Rapport ramed 2013 v2
PPTX
Managing Time as a Coach
PPTX
Fuel cell stacking
PDF
Ob1 unit 4 chapter - 16 - conflict management
PDF
Osvaldo Ajuda C.V.-English
PDF
Marketing_Collateral_Samples_2015_final
DOC
First
BCHS - Final Presentation
Pharmacy baba
Pk 08.06 final
Inferring networks of substitute and complementary products
Bob’s training programs
ETP Introduction for Launch Events
American Builders Quarterly 12-12-07
Ob1 unit 4 chapter - 15 - power and politics
Screenplay - 'Kay'
Ob1 unit 4 chapter - 12 - managing teams at work
Rapport ramed 2013 v2
Managing Time as a Coach
Fuel cell stacking
Ob1 unit 4 chapter - 16 - conflict management
Osvaldo Ajuda C.V.-English
Marketing_Collateral_Samples_2015_final
First
Ad

Similar to Pandas & Cloudera: Scaling the Python Data Experience (20)

PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
PDF
Enabling Python to be a Better Big Data Citizen
PDF
PyData: The Next Generation
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
PDF
High Performance Python on Apache Spark
PDF
High-Performance Python On Spark
PPTX
PyData: The Next Generation | Data Day Texas 2015
PDF
Python Data Ecosystem: Thoughts on Building for the Future
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Cloudera 5.3 Update
PDF
Elephants Ibises and a more Pythonic way to work with databases
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PDF
Impala use case @ edge
PDF
Python as the Zen of Data Science
PDF
DataFrames: The Extended Cut
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Enabling Python to be a Better Big Data Citizen
PyData: The Next Generation
Ibis: Scaling Python Analytics on Hadoop and Impala
High Performance Python on Apache Spark
High-Performance Python On Spark
PyData: The Next Generation | Data Day Texas 2015
Python Data Ecosystem: Thoughts on Building for the Future
My Data Journey with Python (SciPy 2015 Keynote)
An Incomplete Data Tools Landscape for Hackers in 2015
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Cloudera 5.3 Update
Elephants Ibises and a more Pythonic way to work with databases
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Impala 2.0 - The Best Analytic Database for Hadoop
Impala use case @ edge
Python as the Zen of Data Science
DataFrames: The Extended Cut
Cloudera Impala - San Diego Big Data Meetup August 13th 2014

More from Turi, Inc. (20)

PPTX
Webinar - Analyzing Video
PDF
Webinar - Patient Readmission Risk
PPTX
Webinar - Know Your Customer - Arya (20160526)
PPTX
Webinar - Product Matching - Palombo (20160428)
PPTX
Webinar - Pattern Mining Log Data - Vega (20160426)
PPTX
Webinar - Fraud Detection - Palombo (20160428)
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PDF
Pattern Mining: Extracting Value from Log Data
PPTX
Intelligent Applications with Machine Learning Toolkits
PPTX
Text Analysis with Machine Learning
PPTX
Machine Learning with GraphLab Create
PPTX
Machine Learning in Production with Dato Predictive Services
PPTX
Machine Learning in 2016: Live Q&A with Carlos Guestrin
PDF
Scalable data structures for data science
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
PDF
Introduction to Recommender Systems
PDF
Machine learning in production
PPTX
Overview of Machine Learning and Feature Engineering
PPTX
SFrame
PPT
Building Personalized Data Products with Dato
Webinar - Analyzing Video
Webinar - Patient Readmission Risk
Webinar - Know Your Customer - Arya (20160526)
Webinar - Product Matching - Palombo (20160428)
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Fraud Detection - Palombo (20160428)
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Pattern Mining: Extracting Value from Log Data
Intelligent Applications with Machine Learning Toolkits
Text Analysis with Machine Learning
Machine Learning with GraphLab Create
Machine Learning in Production with Dato Predictive Services
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Scalable data structures for data science
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Recommender Systems
Machine learning in production
Overview of Machine Learning and Feature Engineering
SFrame
Building Personalized Data Products with Dato

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf

Pandas & Cloudera: Scaling the Python Data Experience

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  the  Python  Data   Experience   Wes  McKinney                    Marcel  Kornacker   JusFn  Erickson    Silvius  Rus  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Wes  McKinney   •  A  key  person  in  building  today’s  open  source  Python  data  community   •  Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used   by  data  scienFsts   •  Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)   •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  producFvity  for  data  engineers  and  data  scienFsts   • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggregaFons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  ExtracFng  samples  or  aggregaFons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons   • Loss  of  producFvity  with  mulFple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Same  Python,  now  at  scale   •  Target  user:   • Data  scienFsts  and  data  engineers  (“Python  data  users”)   •  Goals:   • Mirrors  single-­‐node  Python  experience   • Scales  to  any  node  and  data  size   • No  compromise  in  funcFonality  or  usability   • InteracFve  experience  at  naFve  hardware  speeds  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  announced?   •  First  public  release  of  Ibis   • hgp://ibis-­‐project.org   •  Beta  release  to  Cloudera  Labs   •  InviFng  usage  and  community  development   •  Apache-­‐licensed  open-­‐source  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  InteracFve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extracFons   • Scalability  for  big  data   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Advantages  of  our  approach   •  Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on   the  local  filesystem   •  Full-­‐fidelity  data  access   •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries   •  Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at   Hadoop-­‐scale  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Beta  0.3  release     •  High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by   Impala   • Familiar  API  for  users  of  pandas   • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows   •  Integrated  tools  for  managing  data  in  HDFS   •  Simple  workflows  to  query  data  files  in  several  formats  (Parquet,  Avro,  Text)   •  pandas  data  interchange  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  IntegraFon  with  full  Python  data  ecosystem   • Advanced  analyFcs  +  machine  learning   • Enable  use  of  performance  compuFng  tools   •  User  extensibility  with  naFve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compilaFon   •  Workflow  and  usability  tools  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  producFvity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extracFons   • Python  analysis  at  any  scale   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   [email protected]