Savulimedu Veeravalli, Varadan. Diagnosis and error correction for a fault-tolerant arithmetic and logic unit for medical microprocessors. Retrieved from https://doi.org/doi:10.7282/T3TQ61T7
DescriptionWe present a fault tolerant Arithmetic and Logic Unit (ALU) for medical systems. Real-time medical systems possess stringent requirements for fault tolerance because faulty hardware could jeopardize human life. For such systems, checkers are employed so that incorrect data never leaves the faulty module and recovery time from faults is minimal. We have investigated information, hardware and time redundancy. After analyzing the hardware, the delay and the power overheads we have decided to use time redundancy as our fault tolerance method for the ALU.
The original contribution of this thesis is to provide single stuck-fault error correction in an ALU using recomputing with swapped operands (RESWO). Here, we divide the 32-bit data path into 3 equally-sized segments of 11 bits each, and then we swap the bit positions for the data in chunks of 11 bits. This requires multiplexing hardware to ensure that carries propagate correctly. We operate the ALU twice for each data path operation -- once normally, and once swapped. If there is a discrepancy, then either a bit position is broken or a carry propagation circuit is broken, and we diagnose the ALU using diagnosis vectors. First, we test the bit slices without generating carriers -- this requires three or four patterns to exercise each bit slice for stuck at 0 and stuck-at 1 faults. We test the carry chain for stuck-at faults and diagnose their location -- this requires two patterns, one to propagate a rising transition down the carry chain, and another to propagate a falling transition. Knowledge of the faulty bit slice and the fault in the carry path makes error correction possible by reconfiguring MUXes. It may be necessary to swap a third time and recompute to get more data to achieve full error correction.
The hardware overhead with the RESWO approach and the reconfiguration mechanism of one spare chunk for every sixteen chunks for the 64-bit ALU is 78%. The delay overhead for the 64-bit ALU with our fault-tolerance mechanism is 110.96%.