Adaptive end-to-end dependability for generic applications in a network environment
MetadataShow full item record
In this dissertation, a provably effective and affordable dependability and security management solution is constructed for applications running in a networked environment and performance studied using analytical models and simulation tools. The solution is built around a component-based approach where existing infrastructures such as high speed networks, spare capacity in cluster servers, runtime program security checks, and mature solutions in checkpointing, voting and recovery are integrated. This approach leads to the synthesis of a new solution from known good components , a paradigm often referred to as the COTS (Commercial Off-The-Shelf) where dependability and security are harnessed through software-implemented fault tolerance. A new dependability paradigm is developed that provides adaptive fault-tolerance ranging between a simplex system and a processor-pair system. The adaptability is achieved using a tunable replication mechanism that replicates a subset of the tasks and checks them for faults. The approach consists of a main processor and an auxiliary processor available for job execution and a comparator module used for comparing the results of the two processors. As applications are executed on the main processor, depending on compile time and runtime specifications, the state of the running application is saved at certain execution points. This process of saving a state at specific points is termed as taking a snapshot. The code between two snapshots is transferred to the auxiliary processor and executed. At the end of every snapshot interval, the states at the main processor and auxiliary processor are compared for validity. As the code gets executed on the auxiliary processor, security checks are also performed on it. Using selective replication rather than complete replication, a lower-end processor can be used as the auxiliary processor instead of a replica. The scheme provides an end-to-end dependability by taking snapshots, running them redundantly, comparing the results, and also checking for security. The auxiliary processor can be a homogeneous system similar to the main processor (such as the processor-pair approach) or a heterogeneous system such as an instruction co-processor or a distributed system. Using a heterogeneous processor for the auxiliary stream increases the security of applications because the code layout and instruction set will be different, which make it difficult for the same malicious code to compromise two different code bases. A prototype implementation and analytical models are developed to study the performability and reliability improvements. Generalized Stochastic Petri Net (GSPN) is used for modeling the system components because of its ability to capture system dynamics in a succinct manner. When the auxiliary processor is heterogeneous to the main processor, the snapshot state and the code between two snapshots have to be converted from one architectural specification to another. Binary translation is used to convert the code and system state. The frequently executed code sections are discovered using a knapsack-based scheme, and only those code sections get converted to native auxiliary system code. Rest of the code sections is interpreted on the auxiliary system. During translation, the code gets optimized using a register allocation scheme to reduce the number of register spills. Additionally during code translation, security checks are inserted into it to check for security flaws such as bounds overflow, return address mismatch, and malicious code execution. Though the proposed technique would offer somewhat reduced fault-tolerance compared to complete replication, it provides significant assurance at a nominal cost for a single stream of computation. Valgrind and Sharpe are used for prototype development and running the analytical models. SpecCPU_2000 integer benchmarks are used to determine the performance results.