Fault Tolerance in AWS Distributed Cloud Computing

Fault Tolerance in Amazon Web Service (AWS)
Caner Kaya
Computers and System Engineering
Tallinn University of Technology
Tallinn, Estonia
canerkaya89@gmail.com
Abstract— The cloud computing enable information
technologies solutions by using the visual machines to
provide resource-sharing and using on demand basis; so
within this complex, this area is becoming more attractive
for researching. Upon the rapid development of these
technologies, the fault tolerance of cloud computing has
become one of the most important topic for information
technologies. This requirement has become forefront since,
this system needs reliability and must be ready all the time.
This case-study, review the techniques that protect the
cloud computing and user systems from process fault. One
of the indications is as shown below, that, the cloud
computing is prone to create faults. The main goals of the
fault tolerance are to protect financial loses, to achieve the
restoration of the system. The case study has review the
scenario that the fault, repetitions could be solved by
checkpoints and back-ups. The Amazon AWS is shown
asan example for the fault-tolerance.
Keywords- Cloud Computing; Fault Tolerance;
Dependability ; Availability ; Redundancy; Human
Factor: Replication ;Amazon Web Services.
III... INTRODUCTION
That processing and storage technologies have developed
rapidly and Internet has been successful made computing
resources more affordable, more powerful and easier to access
compared to the past. Thanks to this technological
improvement, a new computing model, which is called cloud
computing, could be developed. In cloud computing, resources
such as CPU and storage are presented to users as general
services to be leased through internet or upon demand. In the
environment of cloud computing, service provider has two
conventional functions: infrastructure providers regulate cloud
platforms and rent resources within the frame of a user-based
price tariff and service providers lease resources from one or
more infrastructure providers to provide service for end users.
Over the recent years, the development of cloud computing
altered Information Technology Industry (IT) considerably,
thus major companies like Google, Amazon and Microsoft
cloud platforms seek the ability to provide more powerful and
reliable cloud platforms with lower costs, and business
enterprises want to restructure their existing business models
to benefit from such a new phenomenon. In fact, as it can be
seen below, many fascinating properties presented by cloud
computing attract business owners.
 No up-front investment: A pricing model called pay-
as-you-go is used by cloud computing. In order to
obtain interest from cloud computing, no
infrastructure investment is required from the service
provider. The method is simply leasing needed
resources form the cloud and paying only for the
usage. Lower operating cost: In a cloud
environment, it is possible to designate resources
more than one time based on the demand. Thus, no
provision capacities are required for the peak load
anymore. Thus, in case of low demand, resources
can be released and operating costs can be lowered.
 Highly scalable: Large amount of resources is pulled
from data centers provided by infrastructure
providers – this gives straightforward and easy access
to the resources. The service is expandable to large
scales – it allows to handle rapid increase in service
demands (flash-crowd effect etc.) This model is often
referred as surge computing [1]
 Easy access: Clouds normally host web-based
services. Thus, more than one device with an Internet
connection can easily access these services. Even
smart phones and PDAs are suitable for providing
access as well as desktop and laptop computers.
Reducing business risks and maintenance expenses:
As infrastructures for the service are outsourced to
the clouds, infrastructure providers take over business
risks like hardware failure and with more expertise
and equipment, service providers are more likely to
remove the risks. Moreover, expenses like hardware
maintenance and personnel training costs can be
reduced by the business owner. On the other hand,
while cloud computing provides many possibilities
for IT Industry, some difficulties unique to its
structure, which should be carefully handled, might
be encountered. In this study, we examine a cloud
computing research with key concepts, properties of
architecture, cutting edge practices and drawbacks of
the research.

IIIIII... TECHNOLOGY OVERVIEW
AAA... Definition of Cloud Computing
Today, cloud computing has become a wide-spread
phenomenon. Cloud computing can be defined as a new
practice to use information technology as a Internet-based
service. Perhaps, there may be different views regarding cloud
computing. In fact, every individual engaging in different
aspects of cloud computing may define it differently.
Following are different definitions of cloud computing by
major companies, organizations and individuals.
“National Institute of Standards and Technology” (NIST)
team has a more precise and technical definition: “Cloud
computing is a model for enabling ubiquitous, convenient, on-
demand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage,
applications, and services) that can be rapidly provisioned and
released with minimal management effort or service provider
interaction.” [2]
Irving Wladawsky Berger (IBM) definition :
"When virtualizing applications to be used by people who care
nothing about computers or technology - as is mostly the case
with Clouds - the key thing we want to virtualize or hide from
the user is complexity. Most people want to deal with an
application or a service, not software. ... The more intelligent
we want [computers and computer applications] to be - that
is, intuitive, exhibiting common sense and not making us have
to constantly take care of them - the more smart software it
will take. But with cloud computing, our expectation is that all
that software will be virtualized or hidden from us and taken
care of by systems and/or professionals that are somewhere
else - out there in The Cloud." [3]
BBB... Cloud Computing Service Models
Figure 1: three main cloud service models
There are three main service models:
 SaaS – Software-as-a-Service
“National Institute of Standards and Technology”
(NIST) definition: “The capability provided to the
consumer is to use the provider’s applications
running on a cloud infrastructure . The applications
are accessible from various client devices through
either a thin client interface, such as a web browser
(e.g., web-based email), or a program interface. The
consumer does not manage or control the underlying
cloud infrastructure including network, servers,
operating systems, storage, or even individual
application capabilities, with the possible exception
of limited user specific application configuration
settings.” [5]
 IaaS – Infrastructure-as-a-Service:
IaaS is a model in which hardware, software, servers
and other elements of infrastructure are hosted by a
third-party provider in support of users. User
applications are also hosted by IaaS providers and
they take over tasks such as maintenance of the
system, back up and resiliency planning. Highly
scalable resources provided by IaaS platforms can be
adapted according to demand. Thus, IaaS is
appropriate for temporary works with sudden
changes or experimental works. Management task
automation, dynamic scaling, virtualization of
desktop and services based on policy are among other
properties of IaaS environment. Pricing of IaaS is
hourly, weekly or monthly based on the usage. Some
providers also charge customers based on the amount
of virtual machine space they use. Thanks to this pay-
as-you-go pricing model, in-house hardware and
software capital and costs are removed. On the other
hand, if IaaS environments of users aren’t supervised
carefully, there may be charges for unauthorized
services.
Amazon Web Services (AWS), Windows Azure,
Google Compute Engine, Rackspace Open Cloud,
and IBM SmartCloud Enterprise are among major
IaaS providers.
 PaaS – Platform-as-a-Service is a cloud computing
model delivering applications over the Internet. In a
PaaS model, a new application hardware and
software tools required for application development
are delivered to users by a cloud provider. Hardware
and software are hosted in the infrastructure of a
PaaS provider. Therefore, PaaS users don’t need to
have in house hardware and software and still can
develop and run applications.
Also, some variant services are available from three main
service models:

 DaaS – Desktop-as-a-Service has a multi-tenancy
architecture and the service is purchased on a
subscription basis. In the DaaS delivery model, the
service provider manages the back-end
responsibilities of data storage, backup, security and
upgrades. Typically, the customer's personal data is
copied to and from the virtual desktop during
logon/logoff and access to the desktop is device,
location and network independent.
 XaaS – Everthing-as-a-Service was first developed as
software-as-a-service (SaaS) then expanded in time
and includes services such as infrastructure-as-a-
service, platform-as-a-service, storage-as-a-service,
desktop-as-a-service, disaster recovery-as-a-service,
and even nascent operations like marketing-as-a-
service and healthcare-as-a-service.
 CaaS – Communication-as-a-Service, allows the
consumer to utilize Enterprise level VoIP, VPNs,
PBX and Unified Communications free from
expensive investment of purchase.
 MaaS – Monitoring-as-a-Service is online state
monitoring, continuously tracking definite situation
of applications, networks, systems, instances or any
element deployable within the cloud.
CCC... Definition of Virtualization
Intel definition is; “Virtualization abstracts compute
resources—typically as virtual machines (VMs)—with
associated storage and networking connectivity. The cloud
determines how those virtualized resources are allocated,
delivered, and presented. Virtualization is not necessary to
create a cloud environment, but it enables rapid scaling of
resources in a way that nonvirtualized environments find
hard to achieve. “[6]
DDD... Service Attributes
 On-demand self-services
Consumers can access cloud resources such as server or
storage by using websites or web services interface whenever
they need them. Automatically, they can order, customize, pay
without interaction with the cloud provider’s personnel. [7]
 Broad network access
Because cloud computing services are web-based technology,
a consumer can gain resources over the Internet using standard
methods such as heterogeneous OSs, or thick and thin
platform as laptops, and smart phones [8]. Therefore cloud
computing is device independent. [9]
 Resource pooling
Cloud computing supports multi-tenant resource usage. In
other words, a pool of provider’s computing resources is
shared among a large number of consumers. These virtual
resources allocate and reallocate relying on consumer demand
[2].
In fact, resource pooling in cloud computing based on
abstraction concept, this mean that the exact location of
resources is not stated (e.g. VMs, processing, memory,
storage, or connectivity). However, it may be able to specify
location at a higher level of abstraction as country name, state,
or datacenter. [2]
 Rapid elasticity
Resources can be supplied or released quickly and elastically
so that consumers capable to scale up/add or scale down
resources in an automatic or manual method, as well as in
various quantities and at different time [7]; much or less
electricity required from the power grid.
Figure 2 : Automated elasticity in AWS[32]
 Measured services
A metered system is used which makes consumers pay-per-
use billing, for example the amount of storage billed by the
day, the amount of processing power billing by the hour, as
well as network I/O bandwidth and number of transactions.
[10]
EEE... Deployment models
 Private cloud “internal cloud”
For the unique use of a specific association or business
that provides full authority over data, maintenance,
security, and quality of services, a private cloud is built.
A business might be built and managed its own private
cloud in-house or the operation may be fully done by third
party providers on the premises [12]. Thus, a pool of
computing resources across applications, departments or
business units can be shared by enterprises and projects
Unlike the public cloud, this model needs considerable
up-front costs, continuous maintenance, hardware,

software, datacenter, and internal expertise also [8].
Therefore, private clouds are considered as the secure
style of IaaS. [9]
 Public Cloud: Services and utilities are available for the
general public, and are used in pay-per-use consuming
model. Public clouds are run by the third party providers
and their physical infrastructure and resources are often
hosted off-premises from consumers. Such cloud reduces
consumer risk and cost throughout, providing elastic and
even provisional extension to enterprise infrastructure
[11]. Common examples of public clouds are Amazon
AWS, such as EC2, S3, Microsoft Azure, and Rackspace
Cloud Suite.
 Community cloud
A number of organizations share the cloud infrastructure
to serve a common function or purpose alternatively share
similar concerns such as security requirements, policies,
missions, regulatory compliance needs, and so on [12].
Additionally, a constituent or a third party can manage the
community cloud. [13]
 Hybrid cloud
Finally, combining multiple public and private cloud
models by standardized technology to distribute
applications across them is known as hybrid cloud. The
fact that hybrid clouds have the ability to spread
applications and data from one cloud to another [7]. This
model can be applied by many enterprises that use public
cloud for general computing while customers’ data is kept
within a private cloud [8]. Most popular and wide-spread
hybrid clouds are Amazon Virtual Private Cloud, Skylap
Virtual La band they offer hybrid cloud services.
 Virtual Private Cloud:
Virtual Private Cloud (VPC) can be defined as an
alternative solution for the restrictions of public and
private cloud. A VPC is a platform running on top of
public clouds, basically. The difference of VPS is that
VPC leverages virtual private network (VPN) technology
allowing customized topology designs and security
settings as firewall rules for service providers. That VPC
virtualizes underlying communication network as well as
servers and applications makes it a more holistic design.
Besides, thanks to the virtualized network layer, most
companies are provided with a smooth transition from
proprietary service infrastructure to a cloud based
infrastructure by VPC.
IIIIIIIII...HISTORY OF CLOUD COMPUTING
 1950s - During this decade, the word 'cloud' still
refers to a visible mass of condensed water vapor
floating in the atmosphere. The mainframe and time
sharing are born introducing the concept of shared,
centralized compute resources.
 1969 – The first working prototype of ARPANET is
launched, linking four geographically dispersed
computers over what is now known as the Internet.
 Late 1970s – The term ’client-server’ come into use
defining the computing model where clients access
data and applications from a central server over a
local area network.
 1995 – Pictures of clouds start showing up in network
diagrams denoting anything too complicated for non-
technical people to understand.
 1999 – Salesforce.com launches, becoming the first
company to make enterprise applications available
from a website.
 1999 – Google launches a fledgling search service
that returns impressive results.
 2003 – Web 2.0 is born, characterized by rich
multimedia, user-generated content and dynamic
interfaces.
 2006 - Amazon launches amazon web services
(AWS), giving users a new way to store data offsite
and rent compute cycles as a service.
 2007 – Netflix launches streaming service and binge-
watching is born.
 2008 – The concept of private cloud emerges, viewed
by enterprises as a more secure version of the named
‘public cloud’.
 2008 – Dropbox launch for a personal cloud storage
service.
 2009 – Browser-based cloud enterprise applications
like google Apps are introduced revolutionizing the
market for productivity applications.
 2010 – The open-source cloud launches like
OpenStack.
 2011 – Hybrid cloud emerges, combining public and
private cloud environments to the delight of trigger-
shy IT departments.
 2011 – Microsoft’s ‘to the cloud’ commercials
launch, attempting to explain how the cloud can
benefit mere mortals.
 2011 – Apple launches iCloud letting people
automatically can back-up all content on phone.
 2012 – Google launches google drive with free cloud
storage for digital packrats.
 2012 – In Turkey, people start to learn cloud
computing from a Turkish politician who was
Minister of Transport, Maritime and Communication
of Turkey (Binali Yildirim). He said that ‘Nowadays,
there emerged something called ’cloud system’.
Nowadays, everybody drops something in it and
takes from there, what required. I understand like
that, it might be a different thing. There is no
systematic ‘thing’, anymore. You stack everything in
it, everybody takes what he/she needs, however
nothing gets mixed up. You find whatever you want.
That information technology… If you ponder too
much, then you’ll get crazy! You’ll use it and benefit
from it for your work.”[34]

IIIVVV... DEPENDABILITY TREE IN CLOUD COMPUTING
Figure 3: Dependability Tree
AAA... Attributes
Data centers should provide those listed below for the
High-availability assurance of the cloud services;
 Failure-isolated zone: They are named as
Availability Zones in Amazon EC2. Users practicing
IaaS know the places of their application instances.
These geographical locations of cloud datacenter are
known as zones and they isolate failure in one zone
from the other. Thus, distributing the users’
application instances between the multiple zone can
increase the availability rate.
 Automatic scale-up: This function provides
automatical start and stop their instances depending
on load. But, the customer must determine how they
wish to scale according to the changing demand. This
function is useful as it provides the high-availability
of applications in case of running server process
failures. Also in financial terms ,this function is more
sensible.
 Configurable load balancer: dynamically
configuration of load balancer in distributing the
requests to a different zone assists fulfilling high-
availability. [14]
BBB... Threats
 Provider-inner faults: prevalent methods recovering
services from failure are redundancy, backup or stop
and restart services.
 Provider-user: Faulty nodes may result from network
congestions, hacker attack, browser collapse, time out
of the request, or malicious.
 User-across: sharing critical resources among users
throughout may cause chaos in a cloud computing
system due to unsafe access to the resources. [15]
 Datacenter hardware failures: processor, hard disk
drive, integrated circuit socket and memory [16].
 Datacenter software faults: lead to application
failures
 Crash faults: either stop functioning of the system
components or not returning to a right condition
might cause crash faults (e.g. hard disk crash)
 Byzantine faults: this malevolent fault leads the
system components behave arbitrary or maliciously
and causes production of incorrect and different
output values.
CCC... Means
Fault avoidance aims to prevent faults from occurring
in the operational system. It limits introduction of
faults during system construction. It includes fault
prevention, fault removal, and fault forecasting [3].
Removing any possible faults creeping into a system
before it goes operational is the function of fault
prevention. Fault removal attempts to find and
remove the causes of errors. Hence, fault avoidance
contributes to the improvement of the quality of both
the components and the systems. Fault forecasting
evaluates, estimates or ranks the system behavior
during fault activation.
VVV... FAULT TAXONOMY IN THE CLOUD COMPUTING
There are different types of faults in the cloud, however the
cloud is prone to most of them. Fault resolving mechanism
includes various fault tolerance techniques at either task
level or workflow level[17].
Figure 4: Fault Tolerance Taxonomy

AAA... Proactive Fault tolerance
Proactive fault tolerance is a function that foresees the
fault before they emerge, replace the components which
might cause it with the properly working ones and
preventing faults and errors before the need to recovery.
Preemptive migration, software rejuvenation etc. follow
this policy.
 Software Rejuvenation-the system is planned for
periodic reboots and with every reboot the system
starts with a new state. For applications which run for
a long period without stopping, and thus contributes
to the risk of failure or decreased the reliability and
performance, rejuvenation is a low-cost and proactive
technique for fault management. So that, a software
application is called software aging. Rejuvenation
solution scheduling periodic stopping the running of
a software and reboot it so as to cleaning and
refreshing its internal state. [18]
 Self-healing: It automatically controls failure of an
instance of an application running on multiple virtual
machines .
 Preemptive Migration : Preemptive migration is an
avoiding failure technique base on a feedback-loop
control mechanism where an application is constantly
monitored and preventative action is taken to avoid
application failure. However, not all failures can be
expected and covered by preemptive, so that
combination of pro/ post-active fault tolerance
technology will provide a sophisticated mechanism.
[19]
BBB... Reactive fault tolerance
Reactive fault tolerance techniques are used to reduce the
impact of failures on a system when the failures have
actually occurred. This technique provides robustness to a
system. Techniques based on this policy are
checkpoint/Restart and retry [25].
 User-Defined Exception Handling
The ability of allowing users to specify the defined
exceptions to handle task specific failures relying
on the context of the task, as well as, the ability to
define customs procedures to handle these errors
as well.
 Task Resubmission A new technique tries to re-
execute the same task whenever a failed task is
detected and is utilized during the workflow
execution phase. However, a task may resubmit
either to the same or to the different resources low
at run time without interrupting the workflow of
the system. [21]
 Retry: Repeatedly, tries the failed task on the same
cloud resources. [22]
 S-Guard: A fault tolerance technique is used for
distributed stream processing engines (SPEs) such
as email, online games, e-commerce, instance
messaging, and search. It is relied on a rollback
recovery that capable of checkpoint the state of
stream processing nodes recurrently and restart
failed nodes from last checkpoint. [23]
 Replication
The availability of replicated resources is a key
requirement for the forming of fault tolerant
systems in the cloud. Simply, replication means
several copies of an application with the same
input-set are executed simultaneously on
alternative sites [24]. For instance, proxy server
and the caching in web browser can be considered
as a form of replication. The main target of
replication is guaranteeing at least one replica to
complete the task correctly in case others fail [33].
More than one replication mechanisms such as
active, passive, or semi-active have been used in
the cloud computing.
 Job Migration Moves a job’s state from a particular
machine (node) to another when a node cannot be
completely executed or processed the tasks.
However tasks can be migrated by using HAproxy
tool and load balancing. [25, 26]
 Checkpoint / Restart (C/R) recovery: C/R is the
typical technique to tolerate failure on unreliable
systems. By saving a snapshot of running
application on a stable storage periodically so as to
restart the application from a latest checkpointing
image in case of a crash. [27]
Until today, researchers examined many
checkpoint strategy types, but only three
checkpoint fault tolerance strategies are widely
used among them in cloud computing.
- Full checkpoint: is a traditional mechanism
which saves the total state of the application
or the system periodically to a storage
platform. The drawback of this mechanism is
the time which is consumed to make a
snapshot of a whole system. And also the
consumed of a large storage to save the whole
system running states. [28]
- Incremental checkpoint: The first checkpoint
is full while the subsequent checkpoints only
save pages that have been modified. This
procedure produces a large recovery overhead
due to the system must recover from the
starting checkpoint. [29]
- Hybrid checkpoint: is a combination between
the full checkpoint and the incrementing

strategies. Therefore a balance between the
checkpoint overhead and the fault recovery
overhead should achieve.
VVVIII... FAILURE DETECTOR
Failure detector can be defined as an application or system
used in order to determine node failures or crashes. Failure
detectors can be determined as reliable or unreliable according
to yielded results. Correctness properties of failure detectors:
 Completeness: Process failure should be detected by
at least one other non-faulty process. Completeness
describes the capability of failure detector of
suspecting every failed process permanently..
 Accuracy: Failure predictions should be accurate and
contain no mistakes Less number of false positives
result in high accuracy. 100% accuracy is hardly
feasible – real life failure detectors guarantee the
completeness but have some faults in their accuracy
either practical or probabilistic. So, there always be
some trade-offs between completeness and accuracy.
 Speed: Prediction time of a failure should be as less
as possible.
 Scale: The load should be low and equally distributed
– each process in a group should have low overall
network load.
VVVIIIIII... FEATURES OF AMAZON WEB SERVICE OVERVIEW
Figure 5: AWS structure [31]
AAA... Definition of AWS features
The following are service definitions from [31] and other
sources, so, the text is used here like a citation:
 Amazon Elastic Compute Cloud (Amazon EC2) is a
web service that provides resizable compute capacity
in the cloud. You can bundle the operating system,
application software and associated configuration
settings into an Amazon Machine Image (AMI)
allows to bundle the operating system and application
software and associate configuration settings with
that. AMI then can be used to provide multiple
virtualized instances as well as decommission them
using simple web service calls to scale capacity up
and down quickly, as your capacity requirement
changes. There are On-Demand Instances in which
the instances can be paid by the hour or Reserved
Instances in which you pay a low, one-time payment
and receive a lower usage rate to run the instance
than with an On-Demand Instance or Spot Instances
where unused capacity could be bid to further reduce
cost of the product. Instances can be launched in one
or more geographical regions. Each region has
multiple Availability Zones. Availability Zones are
distinct locations that are engineered to be insulated
from failures in other Availability Zones and provide
inexpensive, low latency network connectivity to
other Availability Zones in the same Region. [31]
 Amazon CloudWatch is a monitoring service for
AWS cloud resources and the applications you run on
AWS. Amazon CloudWatch is a service which can be
used to collect and track metrics, collect and monitor
log files, set alarms, and automatically react to
changes in your AWS resources. One can monitor
AWS resources such as Amazon EC2 instances,
Amazon DynamoDB tables, and Amazon RDS DB
instances, as well as custom metrics generated by
your applications and services, and any log files your
applications generate with Amazon CloudWatch.
Amazon CloudWatch can be used to gain system-
wide visibility into resource utilization, application
performance, and operational health. One can use
these insights to react with the application and to
keep its smooth run. [31]
 Amazon Virtual Private Cloud (Amazon VPC) Within
your logically isolated network, Amazon VPC
provides complete control over your virtual
networking environment, including selection of your
own IP address range, creation of subnets, and
configuration of route tables and network gateways.
[31]
 Amazon Relational Database Service (Amazon RDS)
provides an easy way to setup, operate and scale a
relational database in the cloud. A DB Instance can
be launched and access to a full-featured MySQL
database is provided. One should not worry about
common database administration tasks like backups,

patch management etc. – the service includes these
features[31].
 Amazon Simple Queue Service (Amazon SQS)
computers and other components of the system can
use this service is a reliable, highly scalable, hosted
distributed queue to store messages[31].
 Amazon Simple Notifications Service (Amazon SNS)
provides a simple way to notify applications or
people from the cloud by creating Topics and using a
publish-subscribe protocol[31].
BBB... AWS specific tactics for best practice implementing:
1. Failover carefully using Elastic IPs: We can
dynamically and speedily re-map and failover to
another server group so that our traffic is routed
to the new servers using Elastic IP. When we
need to upgrade from old to new versions or in
case of some failures we can easily access this
service to use[31].
2. Use multiple Availability Zones: Availability
Zones are conceptually such as logical
datacenters. This concept insure our data with
high availability[31].
3. Use Amazon RDS Multi-AZ deployment
functionality to directly replicate database
updates across multiple Availability Zones..[31]
4. Maintain an Amazon Machine Image so that you
can restore and clone environments very easily
in a different Availability Zone; Maintain
multiple Database slaves across Availability
Zones. [31].
5. Utilize Amazon CloudWatch to get more
visibility and take appropriate actions in case of
performance degradation or hw failure. Setup an
Auto scaling group to maintain a fixed fleet size
so that it replaces bad condition or spoilt
Amazon EC2 instances by new ones. [31]
6. Use Amazon EBS and set up a time-based job
scheduler (cron) so that incremental snapshots
are automatically uploaded to Amazon S3 and
data is totally independent of your instances. [31]
7. Engage Amazon RDS and when you set the
waiting period for backups, that it can perform
backups itself. [31]
CCC... Specific details of AWS
As stated in [29,30,31,35,36] AWS includes the following
services and features and environments (following italics are
direct citations):
 Virtual computing environments, known as instances
Preconfigured templates for your instances, known as
Amazon Machine Images (AMIs), that package the
bits you need for your server (including the operating
system and additional software)
 Network firewalls built into Amazon VPC, and web
application firewall capabilities in AWS WAF let you
create private networks, and control access to your
instances and applications
 Encryption in transit with TLS across all services.
 Connectivity options that enable private, or
dedicated, connections from your office or on-
premises environment
 Ability to deploy DDoS mitigation technologies as
part of your auto-scaling or content delivery strategy.
 AWS Identity and Access Management (IAM) lets you
define individual user accounts with permissions
across AWS resources.
 Data encryption capabilities available in AWS
storage and database services, such as EBS, S3,
Glacier, Oracle RDS,SQL Server RDS, and Redshift
Flexible key management options, including AWS
Key Management Service, allowing you to choose
whether to have AWS manage the encryption keys or
enable you to keep complete control over your keys.
 Dedicated, hardware-based cryptographic key
storage using AWS CloudHSM, allowing you to
satisfy compliance requirements.
 A security assessment service, Amazon Inspector,
that automatically assesses applications for
vulnerabilities or deviations from best practices,
including impacted networks, OS, and attached
storage.
 Deployment tools to manage the creation and
decommissioning of AWS resources according to
organization standards.
 Inventory and configuration management tools,
including AWS Config, that identify AWS resources
and then track and manage changes to those
resources over time.
 Template definition and management tools, including
AWS CloudFormation to create standard,
preconfigured environments.
 Deep visibility into API calls through AWS
CloudTrail, including who, what, who, and from
where calls were made.
 AWS have a lot of complies: SOC 1/ISAE 3402, SOC
2, SOC 3, FISMA, DIACAP, and FedRAMP, PCI
DSS Level 1, ISO 9001, ISO 27001, ISO 27018
 Log aggregation options, streamlining investigations
and compliance reporting.
 Alert notifications through Amazon CloudWatch
when specific events occur or thresholds are
exceeded.
 There are several type of purchasing options
according on customers required; On-Demand
Instances, Reserved Instances, Spot Instances,
Dedicated Hosts.
 AWS Multi-Factor Authentication for privileged

accounts, including options for hardware-based
authenticators.
 AWS Directory Service allows you to integrate and
federate with corporate directories to reduce
administrative overhead and improve end-user
experience.
 AWS Direct Connect, you can establish a dedicated
network connection between AWS and your
datacenter, office, or collocation environment. In
many cases, this can provide both lower costs and a
higher level of service than Internet-based
connections.
 Amazon S3 and Amazon Glacier automatically
replicate data across multiple data centers and is
designed to deliver 99.999999999% durability.
 Various configurations of CPU, memory, storage,
and networking capacity for your instances, known
as instance types.
 Secure login information for your instances using key
pairs (AWS stores the public key, and you store the
private key in a secure place).
 Storage volumes for temporary data that's deleted
when you stop or terminate your instance, known as
instance store volumes.
 Persistent storage volumes for your data using
Amazon Elastic Block Store (Amazon EBS), known as
Amazon EBS volumes.
 Each AZ has independent infrastructure (power,
cooling, network and security) so they are isolated
with other others. therefore, failure behind one AZ
will not affect the others.
 Multiple physical locations for your resources, such
as instances and Amazon EBS volumes, known as
regions and Availability Zones.
Figure 6: Availability Zones in AWS [36]
 A firewall that enables you to specify the protocols,
ports, and source IP ranges that can reach your
instances using security groups.
 Static IP addresses for dynamic cloud computing,
known as Elastic IP addresses.
 Metadata, known as tags, that you can create and
assign to your Amazon EC2 resources.
 Virtual networks you can create that are logically
isolated from the rest of the AWS cloud, and that you
can optionally connect to your own network, known
as virtual private clouds (VPCs).
CONCLUSION
Consequently, many properties like as cost-effective of
infrastructure resources, managing infrastructure, availability,
and scalability are provided to small and medium scale
business enterprises by cloud computing. In technical terms,
cloud services are trade elements and they don’t guarantee the
continuity of the applications of the customer. Indeed, what
they guarantee is the availability of infrastructure and
components offered to the customer. Consequently, to ensure
the continuity of customer’s system, fault tolerance should be
deployed in the cloud. Analyzing these FT methods and
understanding their restrictions are our eventual targets in
order to build a FT method to manage all fault types in
different aspects. And in this study, we examined amazon web
service properties and indicated the way they tolerate faults.
REFERENCES
1] M. Armbrust , (2009) Above the clouds: a Berkeley view of cloud
computing. UC Berkeley Technical Report [online]. Available from :
https://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-
28.pdf/.
2] P. Mell , T.Grance , (2011), The NIST Definition of Cloud
Computing[online], Available from :
http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf/.
3] J. Geelan , (2009), Twenty One Experts Define Cloud
Computing[online], Available from : http://virtualization.sys-
con.com/node/612375/.
4] Diversity Limited, (2011), Revolution Not Evolution How Cloud
Computing Different from Traditional IT and Why it Matters[online],
Available from :
http://userpages.umbc.edu/~dgorin1/451/cloud/Revolution_Not_Evoluti
on-Whitepaper.pdf/.
5] L. Youseff, M. Butrico, D. Da Silva ,Toward a Unified Ontology of
Cloud Computing[online], Available from:
https://storagemadeeasy.com/files/8f047da34a2d3a3528136ba8b59a465
d.pdf/.
6] Intel,(2013), Virtualization and Cloud Computing[online], Available
from :
http://www.intel.com/content/dam/www/public/us/en/documents/guides/
cloud-computing-virtualization-building-private-iaas-guide.pdf/.

7] L. Krutz , R.Vines, (2010), Cloud Security: a Comprehensive Guide to
Secure Cloud Computing[online],Available from:
https://drive.google.com/file/d/0B-
W0l4MahMzLVVp0UVgyNnh5bnM/.
8] M. Williams , (2010), A Quick Start Guide to Cloud Computing
[online], Available from:
https://23510310jarinfo.files.wordpress.com/2011/09/a-quick-start-
guide-to-cloud-computing.pdf/.
9] C. Barnatt., (2010), A Brief Guide to Cloud Computing [online],
Available from:
http://www.explainingcomputers.com/cloud/BGT_Cloud_Computing_E
xtract.pdf/.
10]B. Furht , Escalante A., (2011), Handbook of Cloud Computing
[online], Available from:
https://studytm.files.wordpress.com/2014/03/hand-book-of-cloud-
computing.pdf/.
11]Sun Microsystems, (2009), Introduction to Cloud Computing
Architecture[online], Available from:
https://java.net/jira/secure/attachment/29265/CloudComputing.pdf/.
12]Dialogic Corporation,(2010) Introduction to Cloud Computing[online],
Available from :
http://www.dialogic.com/~/media/products/docs/whitepapers/12023-
cloud-computing-wp.pdf/.
13]B. Sosinsky , (2011), Cloud Computing Bible[online], Available from:
http://cs.ecust.edu.cn/~yhq/course_files/cloud/Cloud%20Computing%20
Bible.pdf/.
14]F. Machida, E. Andrade, D. S. Kim, K. S. Trivedi, (2011), Candy:
Component-based Availability Modeling Framework for Cloud Service
Management Using SysML[online],Available from :
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6076779/.
15]C. Gong ,J. Liu ,Q. Zhang ,H. Chen ,Z. Gong , (2010), The
Characteristics of Cloud Computing [online], Available from :
http://www.postdm.post.ir/_ITCenter/Documents/TheCharacteristicsofC
loudComputing_20140722_154207.pdf/.
16]A. Amal Ganesh ,M. Sandhya,S. Shankar , (2014), “Study on Fault
Tolerance methods in Cloud Computing [online], Available from
:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6779432/.
17]Z. Amin,N. Sethi,H. Singh,(2015) Review on Fault Tolerance
Techniques in Cloud Computing[online], Available from:
http://research.ijcaonline.org/volume116/number18/pxc3902768.pdf
18]F. Machida , D.S. Kim , J. S. Park., K. S. Trivedi, (2008), Toward
Optimal Virtual Machine Placement and Rejuvenation Scheduling in a
Virtualized Data Center[online], Available from :
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5355515/.
19]C. Engelmann ,G. R. Vallee , T. Naughton ,S. L. Scott , (2009),
Proactive Fault Tolerance Using Preemptive Migration [online],
Available from: http://www.christian-
engelmann.info/publications/engelmann09proactive.pdf /.
20]K. Plankensteiner , R. Prodan, T. Fahringer , (2009), A New Fault
Tolerance Heuristic for Scientific Workflows in Highly Distributed
Environments Based on Resubmission Impact [online], Available from:
21]A. Bala ,I. Chana , (2012), Fault Tolerance- Challenges, Techniques and
Implementation in Cloud Computing[online], Available from :
https://www.researchgate.net/publication/266525159_Fault_Tolerance-
Challenges_Techniques_and_Implementation_in_Cloud_Computing/.
22]Y. Kwon ,M. Balazinska , A. Greenberg , (2008), “Fault Tolerant
Stream Processing using a Distributed, Replicated File System[online],
Available From: http://goo.gl/vzhK6l/.
23]S. M. Ghoreyshi, (2013), Energy-Efficient Resource Management of
Cloud Datacenters Under Fault Tolerance Constraints[online], Available
from:
24]S. Lin, M. Huang ,K. Lai ,K. Huang , (2008), Design and
Implementation of Job Migration Policies in P2P Grid Systems [online],
Available from :
25]P. K. Patra ,H. Singh , G. Singh , (2013), Fault Tolerance Techniques
and Comparative Implementation in Cloud Computing[online], Availble
from:https://www.researchgate.net/publication/258789870_Fault_Tolera
nce_Techniques_and_Comparative_Implementation_in_Cloud_Computi
ng/.
26]Y. M. Essa,(2016), A Survey of Cloud Computing Fault Tolerance:
Techniques and Implementation [online],, Available from:
http://www.ijcaonline.org/research/volume138/number13/essa-2016-
ijca-909055.pdf/.
27]S. Gokuldev ,M. Valarmathi , (2013), Fault Tolerant System for
Computational and Service Grid [online], Available from:
http://www.ijeit.com/vol%202/Issue%2010/IJEIT1412201304_47.pdf/.
28]R.Garg , and A. K. Singh., (2011), Fault Tolerance in Grid Computing:
State of the Art and Open Issues [online], Available from:
http://airccse.org/journal/ijcses/papers/0211cses07.pdf/.
29]Amazon, (2015),Amazon Web Services: Overview of Security
Processes[online], Available from:
https://d0.awsstatic.com/whitepapers/aws-security-whitepaper.pdf/.
30]Amazon, (2015),Amazon Web Services: Overview of Amazon Web
Services [online], Available from:
https://d0.awsstatic.com/whitepapers/aws-overview.pdf/.
31]Amazon, (2011), Architecting for the Cloud: Best Practices [online],
Available from:
https://media.amazonwebservices.com/AWS_Cloud_Best_Practices.pdf
32]L. Youseff, M. Butrico, D. Da Silva ,Toward a Unified Ontology of
Cloud Computing[online], Available from:
https://storagemadeeasy.com/files/8f047da34a2d3a3528136ba8b59a465
d.pdf/.
33]1] P. Latchoumy, P. S. A. Khader , (2011), Survey on Fault Tolerance in
Grid Computing[online], Available from:
http://airccse.org/journal/ijcses/papers/1111ijcses07.pdf/.
34]Minister of Transport, Maritime and Communication of Turkey (Binali
Yildirim speech about cloud (2012),’the cloud’, Available:
https://www.youtube.com/watch?v=10UZwW6563E/
35]Amazon documents website, http://docs.aws.amazon.com/.
36]Amazon AWS website, http://aws.amazon.com/,

Fault Tolerance in AWS Distributed Cloud Computing

More Related Content

What's hot (20)

Similar to Fault Tolerance in AWS Distributed Cloud Computing (20)

Recently uploaded (20)

Fault Tolerance in AWS Distributed Cloud Computing