fault-tolerant communication in super computing

| Posted in | Posted on

Embedded supercomputing is becoming indispensable for complex, computing-intensive scientific and industrial applications, and parallel systems are supplanting traditional uniprocessor platforms. Dependability and fault tolerance thus become critical to the performance of parallel systems. Failures are no longer just undesirable situations; depending on the application, they can be hazardous or even catastrophic.

Migrating to parallel systems offers application developers new prospects but also exposes them to new dangers. Multiprocessor cooperation—a parallel system’s most powerful feature—can also be its fatal weakness. More processors means more faults, and failure of a single processor can crash the whole system.

A major factor is communication system. Interprocessor communication, which coordinates processors and enhances their power, is key to a successful parallel system. Distributed-memory multiprocessor systems rely on message communication between nodes. Message-passing applications are based on either synchronous (blocking) or asynchronous (nonblocking) communication for the coherence of parallel tasks. In the synchronous mode, problems arise when communication links or communicating threads are in an erroneous state (broken links, threads in infinite loops, and so on). When such errors occur, communicating threads remain blocked, since communication cannot be initiated or completed.

Likewise, problems also arise in asynchronous communication when communicating threads are in erroneous state, or when mailbox mechanisms supporting asynchronous communication malfunction. Clearly, fault-tolerant communication mechanisms are key factors in parallel system dependability and can unlock a system’s full potential.

The approach to more dependable systems involves taking fault tolerance (FT) measures at two levels:
· The operating system level
· The application level.
Fortunately, there is a middle way. Developer proposed solutions often stem from common requirements. These requirements can be categorized and addressed in a framework that lies between the application and the operating system. An application developer can then select the desired FT level and tailor FT mechanism to the application, thereby effort and shortening the time to market.

click here to download more information

Comments (0)

Post a Comment