Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016.
All Rights Reserved
Enterprise ready Hadoop
clusters on the cloud
Hadoop Summit, Tokyo
October 2016
Hemanth Yamijala, Hortonworks

All Rights Reserved
Agenda
• Overview
– Hortonworks Data Cloud
– Architecture
• Improving enterprise readiness
– Cloud storage
– Governance
– Reliability and fault tolerance

All Rights Reserved
HORTONWORKS DATA CLOUD -
DEMO

All Rights Reserved
Architecture
Amazon Web Services
Cloudbreak
Services
Cloud controller (aka Cloudbreak)
Cloudbreak
DB
Connector
AWS GCE Azure
HDP Cluster: ETL / EDW
Master GroupMaster Group:
Hive, Spark
Ambari
Slave
Group
Blueprint
HDP Cluster: Analytics
Master GroupMaster Group:
LLAP, Zeppelin
Ambari
Slave
Group
Blueprint
Cloudbreak
Deployer
Access tools
Shell REST API Web UI
OpenStack
S3aFileSystem
S3aFileSystem

All Rights Reserved
Hortonworks Data Cloud - Summary
• Launch and manage clusters by
workload type
– ETL / EDW, Data science, Business
analytics
• Use highly scalable, durable storage for
data (S3) & metadata (RDS)
• Share data and metadata among
multiple ephemeral clusters
• Scale up and down at the click of a
button
• Secure clusters with IAM roles, security
groups, etc.

All Rights Reserved
Matching Hadoop with the Cloud
Datacenter
• Data Locality
• Consistent Storage
• Single cluster
administration
Cloud
• Scalable storage
• Customizability
• Cost effective
compute
• Scalable storage with
performance and
consistency
• Customizability with ease of
administration
• Cost effective compute with
SLA policies

All Rights Reserved
Cloud Storage access facts
HDFS
Application
Input Output tmp
Interaction models
Application
HDFSInput
Output
Copy
• Cloud storage optimizes for scale
– S3 data is replicated for high scale
access, durability
• Data access is remote
– Data locality
– Costlier metadata operations (E.g.
hadoop fs –mv is actually a copy
and delete)
• Eventual Consistency
– Takes time for effect of modification
operations to permeate to all copies

All Rights Reserved
Performance with Scalability
• General strategy: Optimize by workload types
• ETL workloads
– Typical pipeline: Bring in data => Transform => Repair
partitions => Compute statistics
– Multiple metadata calls: Batched and issued in parallel
for performance gains
• Distcp
– Optimized buffer management for transferring large
files
– Randomize input to Distcp to avoid hot-spotting S3
nodes

All Rights Reserved
Performance with Scalability
• Analytics workloads – ORC file related optimizations
– Support fast random access reads (both directions) by
avoiding tearing down S3 HTTP connections
– Pass index information to compute tasks as part of split data
to avoid re-computation
• Ref: http://hortonworks.github.io/hdp-aws/s3-
performance/index.html
• Status: Available, but performance optimizations never
stop 

All Rights Reserved
Correctness with strong consistency
• Write operations followed by read may not return correct results
– Issues for data pipelines, multi-stage jobs, etc.
• S3Guard project: Intermediate, consistent metadata store
• Write calls from S3AFileSystem update both S3 and metadata store
• S3AFileSystem automatically tries to reconcile metadata between
S3 and metadata store on subsequent reads
– Inconsistencies are handled based on policy
• Ref: https://issues.apache.org/jira/browse/HADOOP-13345
• Status: In progress

All Rights Reserved
Securing data access via IAM Roles
• Integration with cloud
provider
• Provide an IAM role as
instance profile for a cluster
• Attach policies for accessing
S3 to the role
– E.g. Read-only access for BI
cluster to specific buckets
• Status: Available

All Rights Reserved
Data governance in Hadoop
• Apache Ranger
– Fine grained, role-based access policies to data
• File system level access control
• Granularity for Hive columns
– Audit access information
• Apache Atlas
– Discover & index metadata
– Track data lineage

All Rights Reserved
Data governance technical
architecture – On Premise
On Premise HDP Cluster
Ranger Admin Policy
Policy
Atlas Admin
Metadata
Governed HDP
Component (E.g.
Hive)
Ranger
Plugin
Atlas
Plugin
LDAP / AD
Data Steward

All Rights Reserved
Data Governance in the Cloud:
Ease of administration with flexibility
• No longer a single compute cluster generating / accessing
data
• Data & Metadata are still single and shared
• Evolve Atlas and Ranger to be data lake centric than cluster
centric
– Shared long running Admin components
– Ephemeral plugins on compute clusters
• Ref: https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md
• Status: Available as a Tech Preview

All Rights Reserved
Shared Ranger / Atlas admin services
Available in Tech Preview in Hortonworks Data Cloud
ETL-EDW Cluster
Governed HDP
Component (E.g. Hive)
LDAP / AD
Ranger
Plugin
Atlas
Plugin
Data Analytics Cluster
Governed HDP
Component (E.g. Hive)
Ranger
Plugin
Atlas
Plugin
Ranger Admin Policy
Policy
Atlas Admin
Metadata
Cloud
Controller
Shared Enterprise Services
Data Steward

All Rights Reserved
HDP Cloud Compute nodes on AWS
• Regular EC2 instances
• Can attach EBS volumes or ephemeral storage disks
• Grouped according to functionality / access
requirements
• Opportunistic provisioning – spot instances (work in
progress)
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller

All Rights Reserved
Reliability with cost benefits
• HDP host instances could become unhealthy
– Unreliable underlying infrastructure
– Spot instances are transient, dependent on bid price
– SLA impact for workloads
• Automatically replace un-healthy nodes
– No costs incurred if node is not functional
– Replace unhealthy instances to maintain a desired capacity
• Status: Work in progress

All Rights Reserved
Auto-recovery of slave nodes
• Use Ambari to detect unhealthy status & notify
Cloudbreak
• Decommission and terminate unhealthy instances
• Provision new instances and add to cluster
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller

Moving towards enterprise ready Hadoop clusters on the cloud

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Moving towards enterprise ready Hadoop clusters on the cloud (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Moving towards enterprise ready Hadoop clusters on the cloud