Explore tens of thousands of sets crafted by our community.
Fault Tolerance in Parallel Systems
15
Flashcards
0/15
Majority Voting
Explanation: Majority voting involves using multiple redundant components and choosing the most common output as the correct result. Example: Triple Modular Redundancy (TMR) where three identical components perform the same operation and the majority output is taken.
Redundancy
Explanation: Redundancy involves duplicating critical components or functions of a system with the intention of increasing reliability. Example: In a server cluster, having multiple servers running the same applications so that if one fails, others can take over.
Error Correction Codes
Explanation: Error correction codes (ECC) are used to detect and correct errors within data. Example: ECC memory that can detect and correct single-bit or multi-bit errors.
Checkpointing
Explanation: Checkpointing is the process of saving the state of a system periodically so that it can restart from the last saved state in case of failure. Example: In distributed computing, storing the state of a computation every N minutes.
Replication
Explanation: Replication involves creating copies of data or services to ensure that failure of a single component does not result in data loss. Example: Database replication to multiple nodes to prevent total data loss during a node failure.
N-Modular Redundancy (NMR)
Explanation: N-Modular Redundancy involves N copies of a component running in parallel, with a voting mechanism to determine the correct output. Example: Quintuple Modular Redundancy (QMR) with five components where the majority vote decides the result.
Hot Swapping
Explanation: Hot swapping allows replacement or addition of components to a system without shutting it down. Example: Replacing a failed hard drive in a RAID configuration without turning off the server.
Heartbeat Mechanism
Explanation: A heartbeat mechanism is a periodic signal sent between components to verify operation and connectivity. Example: Two servers sending 'I'm alive' messages to each other to confirm they are still operational.
Graceful Degradation
Explanation: Graceful degradation allows a system to continue operating at a reduced level of functionality when parts of the system fail. Example: A web service that disables certain non-critical features when it's under heavy load or partial failure.
Rollback Recovery
Explanation: Rollback recovery involves reverting a system to a previously known good state following an error. Example: Using transaction logs in databases to restore to the state before a transaction that caused a crash.
Failover
Explanation: Failover is the process of transferring services and operations to a standby system when the primary system fails. Example: Automatic switching to a backup server when the main server crashes.
Task Rescheduling
Explanation: Task rescheduling involves dynamically reassigning tasks to available resources when some fail. Example: In a grid computing environment, reassigning tasks from an unresponsive node to a functional one.
Self-healing Systems
Explanation: Self-healing systems are capable of detecting and fixing problems automatically. Example: A distributed system that automatically redistributes tasks if a node fails.
Software Redundancy
Explanation: Software redundancy includes implementing additional software services that can take over functionality if the primary service fails. Example: Multiple DNS servers that provide the same naming service to ensure uninterrupted hostname resolution.
Rejuvenation
Explanation: Rejuvenation entails periodically restarting components to clear any faults that may have accumulated over time. Example: Rebooting servers during low-traffic periods to prevent memory leaks from causing problems.
© Hypatia.Tech. 2024 All rights reserved.