scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Dependable and Secure Computing in 2005"


Journal ArticleDOI
TL;DR: Remote physical device fingerprinting is introduced, or fingerprinting a physical device, as opposed to an operating system or class of devices, remotely, and without the fingerprinted device's known cooperation, by exploiting small, microscopic deviations in device hardware: clock skews.
Abstract: We introduce the area of remote physical device fingerprinting, or fingerprinting a physical device, as opposed to an operating system or class of devices, remotely, and without the fingerprinted device's known cooperation. We accomplish this goal by exploiting small, microscopic deviations in device hardware: clock skews. Our techniques do not require any modification to the fingerprinted devices. Our techniques report consistent measurements when the measurer is thousands of miles, multiple hops, and tens of milliseconds away from the fingerprinted device and when the fingerprinted device is connected to the Internet from different locations and via different access technologies. Further, one can apply our passive and semipassive techniques when the fingerprinted device is behind a NAT or firewall, and. also when the device's system time is maintained via NTP or SNTP. One can use our techniques to obtain information about whether two devices on the Internet, possibly shifted in time or IP addresses, are actually the same physical device. Example applications include: computer forensics; tracking, with some probability, a physical device as it connects to the Internet from different public access points; counting the number of devices behind a NAT even when the devices use constant or random IP IDs; remotely probing a block of addresses to determine if the addresses correspond to virtual hosts, e.g., as part of a virtual honeynet; and unanonymizing anonymized network traces.

770 citations


Journal ArticleDOI
TL;DR: The most relevant concepts underlying the notion of database security are surveyed and the most well-known techniques are summarized, and access control systems are described, namely, the discretionary and mandatory access control models, and the role-based access control (RBAC) model.
Abstract: As organizations increase their reliance on, possibly distributed, information systems for daily business, they become more vulnerable to security breaches even as they gain productivity and efficiency advantages. Though a number of techniques, such as encryption and electronic signatures, are currently available to protect data when transmitted across sites, a truly comprehensive approach for data protection must also include mechanisms for enforcing access control policies based on data contents, subject qualifications and characteristics, and other relevant contextual information, such as time. It is well understood today that the semantics of data must be taken into account in order to specify effective access control policies. Also, techniques for data integrity and availability specifically tailored to database systems must be adopted. In this respect, over the years, the database security community has developed a number of different techniques and approaches to assure data confidentiality, integrity, and availability. However, despite such advances, the database security area faces several new challenges. Factors such as the evolution of security concerns, the "disintermediation" of access to data, new computing paradigms and applications, such as grid-based computing and on-demand business, have introduced both new security requirements and new contexts in which to apply and possibly extend current approaches. In this paper, we first survey the most relevant concepts underlying the notion of database security and summarize the most well-known techniques. We focus on access control systems, on which a large body of research has been devoted, and describe the key access control models, namely, the discretionary and mandatory access control models, and the role-based access control (RBAC) model. We also discuss security for advanced data management systems, and cover topics such as access control for XML. We then discuss current challenges for database security and some preliminary approaches that address some of these challenges.

434 citations


Journal ArticleDOI
TL;DR: This paper describes how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and builds a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system.
Abstract: Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

257 citations


Journal ArticleDOI
TL;DR: D-WARD is proposed, a source- end DDoS defense system that achieves autonomous attack detection and surgically accurate response, thanks to its novel traffic profiling techniques, the adaptive response and the source-end deployment.
Abstract: Defenses against flooding distributed denial-of-service (DDoS) commonly respond to the attack by dropping the excess traffic, thus reducing the overload at the victim. The major challenge is the differentiation of the legitimate from the attack traffic, so that the dropping policies can be selectively applied. We propose D-WARD, a source-end DDoS defense system that achieves autonomous attack detection and surgically accurate response, thanks to its novel traffic profiling techniques, the adaptive response and the source-end deployment. Moderate traffic volumes seen near the sources, even during the attacks, enable extensive statistics gathering and profiling, facilitating high response selectiveness. D-WARD inflicts an extremely low collateral damage to the legitimate traffic, while quickly detecting and severely rate-limiting outgoing attacks. D-WARD has been extensively evaluated in a controlled testbed environment and in real network operation. Results of selected tests are presented in the paper.

211 citations


Journal ArticleDOI
TL;DR: An overview of key-distribution methods in sensor networks and their salient features are presented to provide context for understanding key and node revocation and define basic properties that distributed sensor-node revocation protocols must satisfy.
Abstract: Key management has two important aspects: key distribution, which describes how to disseminate secret information to the principals so that secure communications can be initiated, and key revocation, which describes how to remove secrets that may have been compromised. Key management in sensor networks face constraints of large scale, lack of a priori information about deployment topology, and limitations of sensor node hardware. While key distribution has been studied extensively in recent works, the problem of key and node revocation in sensor networks has received relatively little attention. Yet, revocation protocols that function correctly in the presence of active adversaries pretending to be legitimate protocol participants via compromised sensor nodes are essential. In their absence, an adversary could take control of the sensor network's operation by using compromised nodes which retain their network connectivity for extended periods of time. In this paper, we present an overview of key-distribution methods in sensor networks and their salient features to provide context for understanding key and node revocation. Then, we define basic properties that distributed sensor-node revocation protocols must satisfy and present a protocol for distributed node revocation that satisfies these properties under general assumptions and a standard attacker model.

207 citations


Journal ArticleDOI
TL;DR: These models demonstrate that fingerprints embedded by the watermarking scheme are detectable and robust against a wide variety of attacks including collusion attacks.
Abstract: In this paper, we present a technique for fingerprinting relational data by extending Agrawal et al.'s watermarking scheme. The primary new capability provided by our scheme is that, under reasonable assumptions, it can embed and detect arbitrary bit-string marks in relations. This capability, which is not provided by prior techniques, permits our scheme to be used as a fingerprinting scheme. We then present quantitative models of the robustness properties of our scheme. These models demonstrate that fingerprints embedded by our scheme are detectable and robust against a wide variety of attacks including collusion attacks.

168 citations


Journal ArticleDOI
Jian Yuan1, K. Mills1
TL;DR: This paper proposes a method for early attack detection that can monitor the macroscopic effect of DDoS flooding attacks and shows that such monitoring enables DDoS attack detection without any traffic observation in the victim network.
Abstract: Creating defenses against flooding-based, distributed denial-of-service (DDoS) attacks requires real-time monitoring of network-wide traffic to obtain timely and significant information. Unfortunately, continuously monitoring network-wide traffic for suspicious activities presents difficult challenges because attacks may arise anywhere at any time and because attackers constantly modify attack dynamics to evade detection. In this paper, we propose a method for early attack detection. Using only a few observation points, our proposed method can monitor the macroscopic effect of DDoS flooding attacks. We show that such macroscopic-level monitoring might be used to capture shifts in spatial-temporal traffic patterns caused by various DDoS attacks and then to inform more detailed detection systems about where and when a DDoS attack possibly arises in transit or source networks. We also show that such monitoring enables DDoS attack detection without any traffic observation in the victim network.

106 citations


Journal ArticleDOI
TL;DR: A novel fault-tolerant clock synchronization scheme for clusters of nodes in sensor networks, where the nodes in each cluster can communicate through broadcast, and guarantees an upper bound of clock difference between any nonfaulty nodes in a cluster.
Abstract: Wireless sensor networks have received a lot of attention recently due to their wide applications, such as target tracking, environment monitoring, and scientific exploration in dangerous environments. It is usually necessary to have a cluster of sensor nodes share a common view of a local clock time, so that all these nodes can coordinate in some important applications, such as time slotted MAC protocols, power-saving protocols with sleep/listen modes, etc. However, all the clock synchronization techniques proposed for sensor networks assume benign environments; they cannot survive malicious attacks in hostile environments. Fault-tolerant clock synchronization techniques are potential candidates to address this problem. However, existing approaches are all resource consuming and suffer from message collisions in most of cases. This paper presents a novel fault-tolerant clock synchronization scheme for clusters of nodes in sensor networks, where the nodes in each cluster can communicate through broadcast. The proposed scheme guarantees an upper bound of clock difference between any nonfaulty nodes in a cluster, provided that the malicious nodes are no more than one third of the cluster. Unlike the traditional fault-tolerant clock synchronization approaches, the proposed technique does not introduce collisions between synchronization messages, nor does it require costly digital signatures.

84 citations


Journal ArticleDOI
TL;DR: A set of certificate management protocols that allow trading protocol overhead for certificate freshness or the other way around, and a combination of threshold and identity-based cryptosystems to guarantee the security, availability, and scalability of the certification function.
Abstract: Securing ad hoc networks is notoriously challenging, notably due to the lack of an online infrastructure. In particular, key management is a problem that has been addressed by many researchers but with limited results. In this paper, we consider the case where an ad hoc network is under the responsibility of a mother certification authority (mCA). Since the nodes can frequently be collectively isolated from the mCA (e.g., for a remote mission) but still need the access to a certification authority, the mCA preassigns a special role to several nodes (called servers) that constitute a distributed certification authority (dCA) during the isolated period. We propose a solution, called DICTATE (DIstributed CerTification Authority with probabilisTic frEshness), to manage the dCA. This solution ensures that the dCA always processes a certificate update (or query) request in a finite amount of time and that an adversary cannot forge a certificate. Moreover, it guarantees that the dCA responds to a query request with the most recent version of the queried certificate in a certain probability; this probability can be made arbitrarily close to 1, but at the expense of higher overhead. Our contribution is twofold: 1) a set of certificate management protocols that allow trading protocol overhead for certificate freshness or the other way around, and 2) a combination of threshold and identity-based cryptosystems to guarantee the security, availability, and scalability of the certification function. We describe DICTATE in detail and, by security analysis and simulations, we show that it is robust against various attacks.

80 citations


Journal ArticleDOI
TL;DR: It is shown that the rich functionality of most modern general-purpose processors facilitate an automated, generic attack which defeats self-hashing, suggesting that self-Hashing is not a viable strategy for high-security tamper resistance on modern computer systems.
Abstract: Self-hashing has been proposed as a technique for verifying software integrity. Appealing aspects of this approach to software tamper resistance include the promise of being able to verify the integrity of software independent of the external support environment, as well as the ability to integrate code protection mechanisms automatically. In this paper, we show that the rich functionality of most modern general-purpose processors (including UltraSparc, x86, PowerPC, AMD64, Alpha, and ARM) facilitate an automated, generic attack which defeats such self-hashing. We present a general description of the attack strategy and multiple attack implementations that exploit different processor features. Each of these implementations is generic in that it can defeat self-hashing employed by any user-space program on a single platform. Together, these implementations defeat self-hashing on most modern general-purpose processors. The generality and efficiency of our attack suggests that self-hashing is not a viable strategy for high-security tamper resistance on modern computer systems.

79 citations


Journal ArticleDOI
TL;DR: This work proposes several integrated security architectures for distributed client-server group communication systems and discusses performance and accompanying trust issues of each proposed architecture and presents experimental results that demonstrate the superior scalability of an integrated architecture.
Abstract: Group communication systems are high-availability distributed systems providing reliable and ordered message delivery, as well as a membership service, to group-oriented applications. Many such systems are built using a distributed client-server architecture where a relatively small set of servers provide service to numerous clients. In this work, we show how group communication systems can be enhanced with security services without sacrificing robustness and performance. More specifically, we propose several integrated security architectures for distributed client-server group communication systems. In an integrated architecture, security services are implemented in servers, in contrast to a layered architecture, where the same services are implemented in clients. We discuss performance and accompanying trust issues of each proposed architecture and present experimental results that demonstrate the superior scalability of an integrated architecture.

Journal ArticleDOI
TL;DR: An efficient method based on ordered binary decision diagram (OBDD) for evaluating the multistate system reliability and the Griffith's importance measures which can be regarded as the importance of a system-component state of a multistates system subject to imperfect fault-coverage with various performance requirements is presented.
Abstract: Algorithms for evaluating the reliability of a complex system such as a multistate fault-tolerant computer system have become more important. They are designed to obtain the complete results quickly and accurately even when there exist a number of dependencies such as shared loads (reconfiguration), degradation, and common-cause failures. This paper presents an efficient method based on ordered binary decision diagram (OBDD) for evaluating the multistate system reliability and the Griffith's importance measures which can be regarded as the importance of a system-component state of a multistate system subject to imperfect fault-coverage with various performance requirements. This method combined with the conditional probability methods can handle the dependencies among the combinatorial performance requirements of system modules and find solutions for multistate imperfect coverage model. The main advantage of the method is that its time complexity is equivalent to that of the methods for perfect coverage model and it is very helpful for the optimal design of a multistate fault-tolerant system.

Journal ArticleDOI
TL;DR: In this article, two Byzantine asynchronous consensus protocols using two types of oracles, namely, a common coin that provides processes with random values and a failure detector oracle, are presented.
Abstract: This paper is on the consensus problem in asynchronous distributed systems where (up to f) processes (among n) can exhibit a Byzantine behavior, i.e., can deviate arbitrarily from their specification. One way to solve the consensus problem in such a context consists of enriching the system with additional oracles that are powerful enough to cope with the uncertainty and unpredictability created by the combined effect of Byzantine behavior and asynchrony. This paper presents two kinds of Byzantine asynchronous consensus protocols using two types of oracles, namely, a common coin that provides processes with random values and a failure detector oracle. Both allow the processes to decide in one communication step in favorable circumstances. The first is a randomized protocol for an oblivious scheduler model that assumes n > 6f. The second one is a failure detector-based protocol that assumes n > tif. These protocols are designed to be particularly simple and efficient in terms of communication steps, the number of messages they generate in each step, and the size of messages. So, although they are not optimal in the number of Byzantine processes that can be tolerated, they are particularly efficient when we consider the number of communication steps they require to decide and the number and size of the messages they use. In that sense, they are practically appealing.

Journal ArticleDOI
Michael Backes1, Birgit Pfitzmann1
TL;DR: The relation between symbolic and cryptographic secrecy properties for cryptographic protocols is investigated and a general secrecy preservation theorem under reactive simulatability, the cryptographic notion of secure implementation, is shown.
Abstract: We investigate the relation between symbolic and cryptographic secrecy properties for cryptographic protocols. Symbolic secrecy of payload messages or exchanged keys is arguably the most important notion of secrecy shown with automated proof tools. It means that an adversary restricted to symbolic operations on terms can never get the entire considered object into its knowledge set. Cryptographic secrecy essentially means computational indistinguishability between the real object and a random one, given the view of a much more general adversary. In spite of recent advances in linking symbolic and computational models of cryptography, no relation for secrecy under active attacks is known yet. For exchanged keys, we show that a certain strict symbolic secrecy definition over a specific Dolev-Yao-style cryptographic library implies cryptographic key secrecy for a real implementation of this cryptographic library. For payload messages, we present the first general cryptographic secrecy definition for a reactive scenario. The main challenge is to separate secrecy violations by the protocol under consideration from secrecy violations by the protocol users in a general way. For this definition, we show a general secrecy preservation theorem under reactive simulatability, the cryptographic notion of secure implementation. This theorem is of independent cryptographic interest. We then show that symbolic secrecy implies cryptographic payload secrecy for the same cryptographic library as used in key secrecy. Our results thus enable formal proof techniques to establish cryptographically sound proofs of secrecy for payload messages and exchanged keys.

Journal ArticleDOI
TL;DR: A software infrastructure that unifies transactions and replication in three-tier architectures and provides data consistency and high availability for enterprise applications is described.
Abstract: In this paper, we describe a software infrastructure that unifies transactions and replication in three-tier architectures and provides data consistency and high availability for enterprise applications. The infrastructure uses transactions based on the CORBA object transaction service to protect the application data in databases on stable storage, using a roll-backward recovery strategy, and replication based on the fault tolerant CORBA standard to protect the middle-tier servers, using a roll-forward recovery strategy. The infrastructure replicates the middle-tier servers to protect the application business logic processing. In addition, it replicates the transaction coordinator, which renders the two-phase commit protocol nonblocking and, thus, avoids potentially long service disruptions caused by failure of the coordinator. The infrastructure handles the interactions between the replicated middle-tier servers and the database servers through replicated gateways that prevent duplicate requests from reaching the database servers. It implements automatic client-side failover mechanisms, which guarantee that clients know the outcome of the requests that they have made, and retries aborted transactions automatically on behalf of the clients.

Journal ArticleDOI
TL;DR: An analysis of the expressiveness of the constructs provided by the generalized temporal role-based access control model shows that there is a subset of GTRBAC constraints that is sufficient to express all the access constraints that can be expressed using the full set.
Abstract: The generalized temporal role-based access control (GTRBAC) model provides a comprehensive set of temporal constraint expressions which can facilitate the specification of fine-grained time-based access control policies. However, the issue of the expressiveness and usability of this model has not been previously investigated. In this paper, we present an analysis of the expressiveness of the constructs provided by this model and illustrate that its constraints-set is not minimal. We show that there is a subset of GTRBAC constraints that is sufficient to express all the access constraints that can be expressed using the full set. We also illustrate that a nonminimal GTRBAC constraint set can provide better flexibility and lower complexity of constraint representation. Based on our analysis, a set of design guidelines for the development of GTRBAC-based security administration is presented.

Journal ArticleDOI
TL;DR: A newinput vector monitoring concurrent BIST technique for combinational circuits is presented which is shown to be significantly more efficient than the input vector monitoring techniques proposed to date with respect to concurrent test latency and hardware overhead trade-off, for low values of the hardware overhead.
Abstract: Built-in self-test (BIST) techniques constitute an attractive and practical solution to the difficult problem of testing VLSI circuits and systems. Input vector monitoring concurrent BIST schemes can circumvent problems appearing separately in online and in offline BIST schemes. An important measure of the quality of an input vector monitoring concurrent BIST scheme is the time required to complete the concurrent test, termed concurrent test latency. In this paper, a new input vector monitoring concurrent BIST technique for combinational circuits is presented which is shown to be significantly more efficient than the input vector monitoring techniques proposed to date with respect to concurrent test latency and hardware overhead trade-off, for low values of the hardware overhead.

Journal ArticleDOI
TL;DR: This paper presents a family of fair exchange protocols for two participants which make use of the presence of a trusted third party, under a variety of assumptions concerning participant misbehavior, message, delays, and node reliability.
Abstract: Fair exchange protocols play an important role in application areas such as e-commerce where protocol participants require mutual guarantees that a transaction involving exchange of items has taken place in a specific manner. A protocol is fair if no protocol participant can gain any advantage over an honest participant by misbehaving. In addition, such a protocol is fault-tolerant if the protocol can ensure that an honest participant does not suffer any loss of fairness despite any failures of the participant's node. This paper presents a family of fair exchange protocols for two participants which make use of the presence of a trusted third party, under a variety of assumptions concerning participant misbehavior, message, delays, and node reliability. The development is systematic, beginning with the strongest set of the assumptions and gradually weakening the assumptions to the weakest set. The resulting protocol family exposes the impact of a given set of assumptions on solving the problem of fair exchange. Specifically, it highlights the relationships that exist between fairness and assumptions on the nature of participant misbehavior, communication delays, and node crashes. The paper also shows that the restrictions assumed on a dishonest participant's misbehavior can be realized through the use of smartcards and smartcard-based protocols.

Journal ArticleDOI
TL;DR: It is presented a somewhat unexpected result that, in general, the problem of synthesizing failsafe fault-tolerant distributed programs from their fault-intolerant version is NP-complete in the state space of the program.
Abstract: We focus on the problem of synthesizing failsafe fault-tolerance where fault-tolerance is added to an existing (fault-intolerant) program. A failsafe fault-tolerant program satisfies its specification (including safety and liveness) in the absence of faults. However, in the presence of faults, it satisfies its safety specification. We present a somewhat unexpected result that, in general, the problem of synthesizing failsafe fault-tolerant distributed programs from their fault-intolerant version is NP-complete in the state space of the program. We also identify a class of specifications, monotonic specifications, and a class of programs, monotonic programs, for which the synthesis of failsafe fault-tolerance can be done in polynomial time (in program state space). As an illustration, we show that the monotonicity restrictions are met for commonly encountered problems, such as Byzantine agreement, distributed consensus, and atomic commitment. Furthermore, we evaluate the role of these restrictions in the complexity of synthesizing failsafe fault-tolerance. Specifically, we prove that if only one of these conditions is satisfied, the synthesis of failsafe fault-tolerance is still NP-complete. Finally, we demonstrate the application of monotonicity property in enhancing the fault-tolerance of (distributed) nonmasking fault-tolerant programs to masking.

Journal ArticleDOI
TL;DR: A discrete optimization model is proposed to allocate redundancy to critical IT functions for disaster recovery planning and a solution procedure based on probabilistic dynamic programming is presented along with two examples.
Abstract: A discrete optimization model is proposed to allocate redundancy to critical IT functions for disaster recovery planning. The objective is to maximize the overall survivability of an organization's IT functions by selecting their appropriate redundancy levels. A solution procedure based on probabilistic dynamic programming is presented along with two examples.

Journal ArticleDOI
TL;DR: This paper proposes an efficient index-based CIC protocol, called NMMP, which is almost as efficient as FI in some typical computational environments and demonstrates that the two protocols have the same behavior over a tree communication network.
Abstract: Communication-induced checkpointing (CIC) protocols can be used to prevent the domino effect. Such protocols that belong to the index-based category were shown to have a better performance. In this paper, we propose an efficient index-based CIC protocol. The fully informed (FI) protocol proposed in the literature has been known to be the best index-based CIC protocol that one can achieve since the optimal protocol needs to acquire the future information. We discover that the enhancement adopted by such a protocol rarely takes effect in practice. By discarding this enhancement, we obtain a new protocol, called NMMP. Simulation results show that our protocol is almost as efficient as FI in some typical computational environments. Especially, we demonstrate that the two protocols have the same behavior over a tree communication network. Surprisingly, NMMP only has to piggyback on each message control information of constant size, regardless of the number of processes.

Journal ArticleDOI
TL;DR: This work proposes a framework for the controlled revocation of unintended digital signatures, and proposes a solution with a special emphasis on privacy issues.
Abstract: Human users need trusted computers when they want to generate digital signatures. In many applications, in particular, if the users are mobile, they need to carry their trusted computers with themselves. Smart cards are easy to use, easy to carry, and relatively difficult to tamper with, but they do not have a user interface; therefore, the user still needs a terminal for authorizing the card to produce digital signatures. If the terminal is malicious, it can mislead the user and obtain a digital signature on an arbitrary document. In order to mitigate this problem, we propose a solution based on conditional signatures. More specifically, we propose a framework for the controlled revocation of unintended digital signatures. We also propose a solution with a special emphasis on privacy issues.

Journal ArticleDOI
TL;DR: Fault tolerance error-detecting capabilities for the major subsystems that constitute a JPEG 2000 standard are developed and the design strategies have been tested using Matlab programs and simulation results are presented.
Abstract: The JPEG 2000 image compression standard is designed for a broad range of data compression applications. The new standard is based on wavelet technology and layered coding in order to provide a rich feature compressed image stream. The implementations of the JPEG 2000 codec are susceptible to computer-induced soft errors. One situation requiring fault tolerance is remote-sensing satellites, where high energy particles and radiation produce single event upsets corrupting the highly susceptible data compression operations. This paper develops fault tolerance error-detecting capabilities for the major subsystems that constitute a JPEG 2000 standard. The nature of the subsystem dictates the realistic fault model where some parts have numerical error impacts whereas others are properly modeled using bit-level variables. The critical operations of subunits such as discrete wavelet transform (DWT) and quantization are protected against numerical errors. Concurrent error detection techniques are applied to accommodate the data type and numerical operations in each processing unit. On the other hand, the embedded block coding with optimal truncation (EBCOT) system and the bitstream formation unit are protected against soft-error effects using binary decision variables and cyclic redundancy check (CRC) parity values, respectively. The techniques achieve excellent error-detecting capability at only a slight increase in complexity. The design strategies have been tested using Matlab programs and simulation results are presented.


Journal ArticleDOI
TL;DR: A new benchmark tool, SPEK (storage performance evaluation kernel module), for evaluating the performance of block-level storage systems in the presence of faults as well as under normal operations is introduced.
Abstract: This paper introduces a new benchmark tool, SPEK (storage performance evaluation kernel module), for evaluating the performance of block-level storage systems in the presence of faults as well as under normal operations. SPEK can work on both direct attached storage (DAS) and block level networked storage systems such as storage area networks (SAN). Each SPEK consists of a controller, several workers, one or more probers, and several fault injection modules. Since it runs at kernel level and eliminates skews and overheads caused by file systems, SPEK is highly accurate and efficient. It allows a storage architect to generate configurable workloads to a system under test and to inject different faults into various system components such as network devices, storage devices, and controllers. Available performance measurements under different workloads and faulty conditions are dynamically collected and recorded in SPEK over a spectrum of time. To demonstrate its functionality, we apply SPEK to evaluate the performance of two direct attached storage systems and two typical SANs under Linux with different fault injections. Our experiments show that SPEK is highly efficient and accurate to measure performance for block-level storage systems.

Journal ArticleDOI
TL;DR: This paper develops two particular schemes for self-repairing array structures (SRAS) with less hardware overhead cost than higher-level redundancy and without the per-error performance penalty of existing low-cost techniques that combine error detection with pipeline flushes for backward error recovery.
Abstract: To achieve high reliability despite hard faults that occur during operation and to achieve high yield despite defects introduced at fabrication, a microprocessor must be able to tolerate hard faults. In this paper, we present a framework for autonomic self-repair of the array structures in microprocessors (e.g., reorder buffer, instruction window, etc.). The framework consists of three aspects: 1) detecting/diagnosing the fault, 2) recovering from the resultant error, and 3) mapping out the faulty portion of the array. For each aspect, we present design options. Based on this framework, we develop two particular schemes for self-repairing array structures (SRAS). Simulation results show that one of our SRAS schemes adds some performance overhead in the fault-free case, but that both of them mask hard faults 1) with less hardware overhead cost than higher-level redundancy (e.g., IBM mainframes) and 2) without the per-error performance penalty of existing low-cost techniques that combine error detection with pipeline flushes for backward error recovery (BER). When hard faults are present in arrays, due to operational faults or fabrication defects, SRAS schemes outperform BER due to not having to frequently flush the pipeline.

Journal ArticleDOI
TL;DR: It is shown that testing of identical circuits by output comparison can be done effectively even when the input vectors applied to the circuits are not identical, which allows concurrent online testing even when they are not driven from the same source during functional operation.
Abstract: Current designs may contain several identical copies of the same circuit (or functional unit). Such circuits can be tested by comparing the output vectors they produce under identical input vectors. This alleviates the need to observe the output response, and facilitates online testing. We show that testing of identical circuits by output comparison can be done effectively even when the input vectors applied to the circuits are not identical. This allows concurrent online testing even when the circuits are not driven from the same source during functional operation. We investigate several issues related to this observation. We investigate the use of both structural and functional analysis to identify situations where nonidentical input vectors can be used for fault detection based on output comparison. We also consider the use of observation points to improve the fault coverage. We present experimental results to support the discussion and the use of nonidentical input vectors for concurrent online testing of identical circuits.

Journal ArticleDOI
TL;DR: It is argued that automated synthesis of fault-tolerant programs is likely to be more successful if one focuses on problems where safety can be represented in the BT model, and the problem of adding masking fault tolerance to high atomicity programs is NP-complete.
Abstract: In this paper, we investigate the effect of the representation of safety specification on the complexity of adding masking fault tolerance to programs - where, in the presence of faults, the program 1) recovers to states from where it satisfies its (safety and liveness) specification and 2) preserves its safety specification during recovery. Specifically, we concentrate on two approaches for modeling the safety specifications: 1) the bad transition (BT) model, where safety is modeled as a set of bad transitions that should not be executed by the program, and 2) the bad pair (BP) model, where safety is modeled as a set of finite sequences consisting of at most two successive transitions. If the safety specification is specified in the BT model, then it is known that the complexity of automatic addition of masking fault tolerance to high atomicity programs - where processes can read/write all program variables in an atomic step) - is polynomial in the state space of the program. However, for the case where one uses the BP model to specify safety specification, we show that the problem of adding masking fault tolerance to high atomicity programs is NP-complete. Therefore, we argue that automated synthesis of fault-tolerant programs is likely to be more successful if one focuses on problems where safety can be represented in the BT model.