Parallel and Distributed Computing Chapter 12

PARALLEL AND DISTRIBUTED COMPUTING
FAULT TOLERANT DISTRIBUTED COMPUTING

FAULT TOLERANCE
 System ability to continue operating uninterrupted despite the
failure of one or more of its components
 How an OS Responds to and allows malfunctions and failures
 It guarantees no break in service
 Recovers from failure completely and transparently

FAULT TOLERANCE
 Every achievement in fault tolerance leads to a drawback
somewhere else
 The system will be slower, take more disk space, utilize more
machines and also increase other costs
 There for fault tolerance is always a trad-off between cost and
the degree of fault tolerance.

FAILUREVS ERROR
 System differs from expected behavior
 Failure might involve the system being unreachable or
producing incorrect output
 Error is incorrectness of system that may lead to a failure.
 Error do not must create failures but can be detect in the
system before they produce failure.

FAULT TOLERANCE
 Fault tolerance usually running through several phases.
 Error Detection: error has to be detect in order to avoid failure.
 Damage Confinement: it must prevent that the error spreads
through other components
 Error recovery: error must be removed, otherwise system would
run into failure

PROCESSOR FAULT
 Occur when the processor behaves in unexpected manner. It may
be classified into three kinds.
1. Fail Stop: totally failed and will never respond, neighboring
processors can detect the failed processor
2. Slowdown: processor might run in degraded form or might
totally fail
3. Byzantine: processor can fail, run in degraded fashion for some
time or execute at normal speed but tries to fail the computation

NETWORK FAULTS
 When processors are prevented from communicating with each
other. Link faults can cause new kinds of problems like
 One way Links: one processor can send messages but other
is not able to receive message.
 Network partition: network of portion is completely isolated
with other

ATTRIBUTES OF FAULT TOLERANT SYSTEM
Fault tolerance system is depended system which requires following
attributes
1. Availability: when system is in a ready state and ready to deliver tis
functions. Highly available systems works at a given instant in time.
2. Reliability: ability of computer to run continuously without failure, it is
defined as time interval instead of instant time. Reliable system works
constantly without interruption.
3. Safety: fails to carry out its corresponding processes correctly and
operations are incorrect but no major disastrous happened and also
doesn’t affect other system to be faulty
4. Maintainability: if failures can be notices and fixed easily.

CLASSIFICATION OF FAILURE
Transient:
Intermittent:
Permanent:

FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEM
 Replication based fault tolerance technique
 Process level redundancy technique
 Fusion based redundancy technique

REPLICATION BASED FAULTTOLERANCE TECHNIQUE
 Replicate the data on other machine. It will not cause the whole
system to stop.
 Replicate the data on different server.

 Problems of replication
 Consistency: major problem of replication is consistency
because of updating by any client. Consistency of data is
ensured by some model such as sequential, causal memory
consistency model
 Degree of replica: large number of replications are needed in
order to achieve high fault tolerance.

PROCESS LEVEL REDUNDANCY TECHNIQUES
 Faults that disappears without anything been done is called transient
faults.This type of faults are hard to identify
 Handling transient fault, software based fault tolerance technique
are used
 PLR Compares processes to ensure correct execution
 Check point and roll back are popular technique in which the
current state of system is done.

FUSION BASEDTECHNIQUE
 Replication: downside is multiple backups that increases cost
 This problem is solved by fusion based technique because it
requires fewer backup
 Backup machines are fused to a given set of system (NP-
Problem)
 Fusion based technique has very high overhead during recovery
process and it’s acceptable in low probability of fault in a
system.

Parallel and Distributed Computing Chapter 12

More Related Content

What's hot (20)

Similar to Parallel and Distributed Computing Chapter 12 (20)

More from AbdullahMunir32 (16)

Recently uploaded (20)

Parallel and Distributed Computing Chapter 12