SlideShare a Scribd company logo
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Enterprise ready Hadoop
clusters on the cloud
Hadoop Summit, Tokyo
October 2016
Hemanth Yamijala, Hortonworks
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Agenda
• Overview
– Hortonworks Data Cloud
– Architecture
• Improving enterprise readiness
– Cloud storage
– Governance
– Reliability and fault tolerance
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
HORTONWORKS DATA CLOUD -
DEMO
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Architecture
Amazon Web Services
Cloudbreak
Services
Cloud controller (aka Cloudbreak)
Cloudbreak
DB
Connector
AWS GCE Azure
HDP Cluster: ETL / EDW
Master GroupMaster Group:
Hive, Spark
Ambari
Slave
Group
Blueprint
HDP Cluster: Analytics
Master GroupMaster Group:
LLAP, Zeppelin
Ambari
Slave
Group
Blueprint
Cloudbreak
Deployer
Access tools
Shell REST API Web UI
OpenStack
S3aFileSystem
S3aFileSystem
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Hortonworks Data Cloud - Summary
• Launch and manage clusters by
workload type
– ETL / EDW, Data science, Business
analytics
• Use highly scalable, durable storage for
data (S3) & metadata (RDS)
• Share data and metadata among
multiple ephemeral clusters
• Scale up and down at the click of a
button
• Secure clusters with IAM roles, security
groups, etc.
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Matching Hadoop with the Cloud
Datacenter
• Data Locality
• Consistent Storage
• Single cluster
administration
Cloud
• Scalable storage
• Customizability
• Cost effective
compute
• Scalable storage with
performance and
consistency
• Customizability with ease of
administration
• Cost effective compute with
SLA policies
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Cloud Storage access facts
HDFS
Application
Input Output tmp
Interaction models
Application
HDFSInput
Output
Copy
• Cloud storage optimizes for scale
– S3 data is replicated for high scale
access, durability
• Data access is remote
– Data locality
– Costlier metadata operations (E.g.
hadoop fs –mv is actually a copy
and delete)
• Eventual Consistency
– Takes time for effect of modification
operations to permeate to all copies
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Performance with Scalability
• General strategy: Optimize by workload types
• ETL workloads
– Typical pipeline: Bring in data => Transform => Repair
partitions => Compute statistics
– Multiple metadata calls: Batched and issued in parallel
for performance gains
• Distcp
– Optimized buffer management for transferring large
files
– Randomize input to Distcp to avoid hot-spotting S3
nodes
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Performance with Scalability
• Analytics workloads – ORC file related optimizations
– Support fast random access reads (both directions) by
avoiding tearing down S3 HTTP connections
– Pass index information to compute tasks as part of split data
to avoid re-computation
• Ref: http://hortonworks.github.io/hdp-aws/s3-
performance/index.html
• Status: Available, but performance optimizations never
stop 
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Correctness with strong consistency
• Write operations followed by read may not return correct results
– Issues for data pipelines, multi-stage jobs, etc.
• S3Guard project: Intermediate, consistent metadata store
• Write calls from S3AFileSystem update both S3 and metadata store
• S3AFileSystem automatically tries to reconcile metadata between
S3 and metadata store on subsequent reads
– Inconsistencies are handled based on policy
• Ref: https://issues.apache.org/jira/browse/HADOOP-13345
• Status: In progress
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Securing data access via IAM Roles
• Integration with cloud
provider
• Provide an IAM role as
instance profile for a cluster
• Attach policies for accessing
S3 to the role
– E.g. Read-only access for BI
cluster to specific buckets
• Status: Available
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Data governance in Hadoop
• Apache Ranger
– Fine grained, role-based access policies to data
• File system level access control
• Granularity for Hive columns
– Audit access information
• Apache Atlas
– Discover & index metadata
– Track data lineage
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Data governance technical
architecture – On Premise
On Premise HDP Cluster
Ranger Admin Policy
Policy
Atlas Admin
Metadata
Governed HDP
Component (E.g.
Hive)
Ranger
Plugin
Atlas
Plugin
LDAP / AD
Data Steward
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Data Governance in the Cloud:
Ease of administration with flexibility
• No longer a single compute cluster generating / accessing
data
• Data & Metadata are still single and shared
• Evolve Atlas and Ranger to be data lake centric than cluster
centric
– Shared long running Admin components
– Ephemeral plugins on compute clusters
• Ref: https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md
• Status: Available as a Tech Preview
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Shared Ranger / Atlas admin services
Available in Tech Preview in Hortonworks Data Cloud
ETL-EDW Cluster
Governed HDP
Component (E.g. Hive)
LDAP / AD
Ranger
Plugin
Atlas
Plugin
Data Analytics Cluster
Governed HDP
Component (E.g. Hive)
Ranger
Plugin
Atlas
Plugin
Ranger Admin Policy
Policy
Atlas Admin
Metadata
Cloud
Controller
Shared Enterprise Services
Data Steward
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
HDP Cloud Compute nodes on AWS
• Regular EC2 instances
• Can attach EBS volumes or ephemeral storage disks
• Grouped according to functionality / access
requirements
• Opportunistic provisioning – spot instances (work in
progress)
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Reliability with cost benefits
• HDP host instances could become unhealthy
– Unreliable underlying infrastructure
– Spot instances are transient, dependent on bid price
– SLA impact for workloads
• Automatically replace un-healthy nodes
– No costs incurred if node is not functional
– Replace unhealthy instances to maintain a desired capacity
• Status: Work in progress
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Auto-recovery of slave nodes
• Use Ambari to detect unhealthy status & notify
Cloudbreak
• Decommission and terminate unhealthy instances
• Provision new instances and add to cluster
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller
© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
THANK YOU! QUESTIONS?

More Related Content

PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
To The Cloud and Back: A Look At Hybrid Analytics
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
Cloudy with a Chance of Hadoop - Real World Considerations
PPTX
A New "Sparkitecture" for modernizing your data warehouse
PPTX
Apache Hadoop YARN: Present and Future
PPTX
Running Services on YARN
PPTX
IoT:what about data storage?
Apache Hadoop 3.0 What's new in YARN and MapReduce
To The Cloud and Back: A Look At Hybrid Analytics
Streamline Hadoop DevOps with Apache Ambari
Cloudy with a Chance of Hadoop - Real World Considerations
A New "Sparkitecture" for modernizing your data warehouse
Apache Hadoop YARN: Present and Future
Running Services on YARN
IoT:what about data storage?

What's hot (20)

PPTX
Scheduling Policies in YARN
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
Hadoop 3.0 features
PPTX
An Overview on Optimization in Apache Hive: Past, Present Future
PPTX
Apache Hadoop 3.0 Community Update
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Operationalizing YARN based Hadoop Clusters in the Cloud
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
PPTX
Evolving HDFS to Generalized Storage Subsystem
PPTX
What's new in Hadoop Common and HDFS
PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PPTX
Apache Hadoop YARN: Past, Present and Future
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PDF
Spark Uber Development Kit
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
Evolving HDFS to a Generalized Storage Subsystem
PPTX
Evolving HDFS to a Generalized Distributed Storage Subsystem
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
Scheduling Policies in YARN
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Hadoop 3.0 features
An Overview on Optimization in Apache Hive: Past, Present Future
Apache Hadoop 3.0 Community Update
Apache Hadoop YARN: Past, Present and Future
Operationalizing YARN based Hadoop Clusters in the Cloud
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Evolving HDFS to Generalized Storage Subsystem
What's new in Hadoop Common and HDFS
The state of SQL-on-Hadoop in the Cloud
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Hadoop in the Cloud - The what, why and how from the experts
Apache Hadoop YARN: Past, Present and Future
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Spark Uber Development Kit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
Hadoop in the Cloud – The What, Why and How from the Experts
Ad

Viewers also liked (20)

PDF
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
PDF
Real-time Analytics in Financial: Use Case, Architecture and Challenges
PDF
The real world use of Big Data to change business
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PPTX
Protecting Enterprise Data In Apache Hadoop
PDF
Path to 400M Members: LinkedIn’s Data Powered Journey
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Why is my Hadoop cluster slow?
PPTX
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
PDF
Comparison of Transactional Libraries for HBase
PPTX
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
PDF
Case study of DevOps for Hadoop in Recruit.
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
PDF
Data science lifecycle with Apache Zeppelin
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Real-time Analytics in Financial: Use Case, Architecture and Challenges
The real world use of Big Data to change business
Security and Data Governance using Apache Ranger and Apache Atlas
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Data infrastructure architecture for medium size organization: tips for colle...
Protecting Enterprise Data In Apache Hadoop
Path to 400M Members: LinkedIn’s Data Powered Journey
Major advancements in Apache Hive towards full support of SQL compliance
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Why is my Hadoop cluster slow?
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Comparison of Transactional Libraries for HBase
A3RT - the details and actual use cases of "Analytics & Artificial intelligen...
Using Hadoop to build a Data Quality Service for both real-time and batch data
Case study of DevOps for Hadoop in Recruit.
Hadoop Summit Tokyo HDP Sandbox Workshop
Data science lifecycle with Apache Zeppelin
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Ad

Similar to Moving towards enterprise ready Hadoop clusters on the cloud (20)

PPTX
Big data spain keynote nov 2016
PDF
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
PPTX
Hadoop & cloud storage object store integration in production (final)
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
Cloudy with a chance of Hadoop - real world considerations
PDF
A Reference Architecture for ETL 2.0
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
PPTX
Built-In Security for the Cloud
PDF
Hadoop Present - Open Enterprise Hadoop
PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
PPTX
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
An Apache Hive Based Data Warehouse
PPTX
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
PDF
What's New in Apache Hive 3.0?
PDF
What's New in Apache Hive 3.0 - Tokyo
Big data spain keynote nov 2016
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Hadoop & cloud storage object store integration in production (final)
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
Cloudy with a chance of Hadoop - real world considerations
A Reference Architecture for ETL 2.0
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Built-In Security for the Cloud
Hadoop Present - Open Enterprise Hadoop
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Hive edw-dataworks summit-eu-april-2017
An Apache Hive Based Data Warehouse
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
August Patch Tuesday
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A comparative study of natural language inference in Swahili using monolingua...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Programs and apps: productivity, graphics, security and other tools
TLE Review Electricity (Electricity).pptx
Spectral efficient network and resource selection model in 5G networks
Univ-Connecticut-ChatGPT-Presentaion.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release
August Patch Tuesday
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Mushroom cultivation and it's methods.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
Accuracy of neural networks in brain wave diagnosis of schizophrenia

Moving towards enterprise ready Hadoop clusters on the cloud

  • 1. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise ready Hadoop clusters on the cloud Hadoop Summit, Tokyo October 2016 Hemanth Yamijala, Hortonworks
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Overview – Hortonworks Data Cloud – Architecture • Improving enterprise readiness – Cloud storage – Governance – Reliability and fault tolerance
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS DATA CLOUD - DEMO
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture Amazon Web Services Cloudbreak Services Cloud controller (aka Cloudbreak) Cloudbreak DB Connector AWS GCE Azure HDP Cluster: ETL / EDW Master GroupMaster Group: Hive, Spark Ambari Slave Group Blueprint HDP Cluster: Analytics Master GroupMaster Group: LLAP, Zeppelin Ambari Slave Group Blueprint Cloudbreak Deployer Access tools Shell REST API Web UI OpenStack S3aFileSystem S3aFileSystem
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Data Cloud - Summary • Launch and manage clusters by workload type – ETL / EDW, Data science, Business analytics • Use highly scalable, durable storage for data (S3) & metadata (RDS) • Share data and metadata among multiple ephemeral clusters • Scale up and down at the click of a button • Secure clusters with IAM roles, security groups, etc.
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Matching Hadoop with the Cloud Datacenter • Data Locality • Consistent Storage • Single cluster administration Cloud • Scalable storage • Customizability • Cost effective compute • Scalable storage with performance and consistency • Customizability with ease of administration • Cost effective compute with SLA policies
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Storage access facts HDFS Application Input Output tmp Interaction models Application HDFSInput Output Copy • Cloud storage optimizes for scale – S3 data is replicated for high scale access, durability • Data access is remote – Data locality – Costlier metadata operations (E.g. hadoop fs –mv is actually a copy and delete) • Eventual Consistency – Takes time for effect of modification operations to permeate to all copies
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance with Scalability • General strategy: Optimize by workload types • ETL workloads – Typical pipeline: Bring in data => Transform => Repair partitions => Compute statistics – Multiple metadata calls: Batched and issued in parallel for performance gains • Distcp – Optimized buffer management for transferring large files – Randomize input to Distcp to avoid hot-spotting S3 nodes
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance with Scalability • Analytics workloads – ORC file related optimizations – Support fast random access reads (both directions) by avoiding tearing down S3 HTTP connections – Pass index information to compute tasks as part of split data to avoid re-computation • Ref: http://hortonworks.github.io/hdp-aws/s3- performance/index.html • Status: Available, but performance optimizations never stop 
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Correctness with strong consistency • Write operations followed by read may not return correct results – Issues for data pipelines, multi-stage jobs, etc. • S3Guard project: Intermediate, consistent metadata store • Write calls from S3AFileSystem update both S3 and metadata store • S3AFileSystem automatically tries to reconcile metadata between S3 and metadata store on subsequent reads – Inconsistencies are handled based on policy • Ref: https://issues.apache.org/jira/browse/HADOOP-13345 • Status: In progress
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Securing data access via IAM Roles • Integration with cloud provider • Provide an IAM role as instance profile for a cluster • Attach policies for accessing S3 to the role – E.g. Read-only access for BI cluster to specific buckets • Status: Available
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data governance in Hadoop • Apache Ranger – Fine grained, role-based access policies to data • File system level access control • Granularity for Hive columns – Audit access information • Apache Atlas – Discover & index metadata – Track data lineage
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data governance technical architecture – On Premise On Premise HDP Cluster Ranger Admin Policy Policy Atlas Admin Metadata Governed HDP Component (E.g. Hive) Ranger Plugin Atlas Plugin LDAP / AD Data Steward
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance in the Cloud: Ease of administration with flexibility • No longer a single compute cluster generating / accessing data • Data & Metadata are still single and shared • Evolve Atlas and Ranger to be data lake centric than cluster centric – Shared long running Admin components – Ephemeral plugins on compute clusters • Ref: https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md • Status: Available as a Tech Preview
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Ranger / Atlas admin services Available in Tech Preview in Hortonworks Data Cloud ETL-EDW Cluster Governed HDP Component (E.g. Hive) LDAP / AD Ranger Plugin Atlas Plugin Data Analytics Cluster Governed HDP Component (E.g. Hive) Ranger Plugin Atlas Plugin Ranger Admin Policy Policy Atlas Admin Metadata Cloud Controller Shared Enterprise Services Data Steward
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Cloud Compute nodes on AWS • Regular EC2 instances • Can attach EBS volumes or ephemeral storage disks • Grouped according to functionality / access requirements • Opportunistic provisioning – spot instances (work in progress) HDP Cluster Master Group Group #1 Gateway node: Ambari Master Group Group #2 Cloud Controller
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reliability with cost benefits • HDP host instances could become unhealthy – Unreliable underlying infrastructure – Spot instances are transient, dependent on bid price – SLA impact for workloads • Automatically replace un-healthy nodes – No costs incurred if node is not functional – Replace unhealthy instances to maintain a desired capacity • Status: Work in progress
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Auto-recovery of slave nodes • Use Ambari to detect unhealthy status & notify Cloudbreak • Decommission and terminate unhealthy instances • Provision new instances and add to cluster HDP Cluster Master Group Group #1 Gateway node: Ambari Master Group Group #2 Cloud Controller
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved THANK YOU! QUESTIONS?