The capacity of a system to keep functioning in the face of hardware or software failure is called

Functional Analysis and Allocation Practice

Richard F. Schmidt, in Software Engineering, 2013

11.2.7 Identify failure conditions

Every functional transaction must be evaluated to identify situations or conditions that may cause failure conditions. Identified failure states must be resolved by stipulating the data integrity criterion that must be interrogated to determine a failure state, and the actions to be taken when a certain state arises to complete the data processing transaction.

Some failure conditions may result in a state that cannot be resolved via automation and requires human intervention. The functional analysis effort must then address the manner by which the software will continue to operate in a degraded mode, if possible. For example, if an ATM’s supply of money has been depleted, then the withdrawal function must be temporarily suspended until the money supply if restocked. The software functions for detecting failure conditions and operating in a degraded mode must be included within the functional architecture. This includes identifying how the software product can be “informed” of the current state of data processing or system resources and how the state indicators are managed.

Potential data processing failure modes and effects must be analyzed to determine how the software product should behave in response to each failure condition. Failure modes and effects analysis (FMEA) is an engineering procedure that enables the design team to classify potential failure modes by the severity (consequences) and likelihood of the failures resulting with improved product quality and dependability. Dependability is a term that is better suited for software products than reliability. Throughout the engineering community, reliability deals with predicting the mean time between failure (MTBF) of hardware components during normal operation and provides an estimate of the expected duration life expectancy for the component. Software does not breakdown or wear out over time with use. Therefore, dependability refers to a software component’s ability to perform its function as expected under all circumstances. Dependability is a more suitable term to be used for software products due to the nature of the material of which it is comprised. If a software component fails, it is due to the software design inability to be resilient to unexpected circumstances or situations.

Software FMEA should be used to identify potential failure modes, determine their effects on the operation of the system or business process, and design response mechanisms that prevent the failure from occurring or mitigate the impact of the failure on operational performance. While anticipating every failure mode may not be possible, the development team should formulate an extensive list of potential failure modes in the following manner:

1.

Develop software product requirements that minimize the likelihood of potential operational failures from arising.

2.

Evaluate the requirements obtained from stakeholders in the software performance and post-development processes to ensure that those requirements do not introduce complicated failure conditions or situations.

3.

Identify design characteristics that contribute to failure detection and minimize failure propagation throughout a data processing transaction.

4.

Develop software test scenarios and procedures designed to exercise the software behaviors associated with failure detection, isolation, and recovery.

5.

Identify, track, and manage potential design risks to ensure that product dependability is predictable and substantiated via the software test effort

6.

Ensure that any failures that could occur will not result in personal injury or seriously impact the operation of the system or operational processes.

Properly used, the software FMEA provides the development team several benefits, including:

1.

Improved software dependability and quality.

2.

Increased customer and stakeholder satisfaction.

3.

Identifies and eliminates potential software failure modes early in the development process when such design challenges can be cost-effectively regulated.

4.

Emphasized failure detection and preventive measures.

5.

Provides a focus for improved software test coverage.

6.

Minimizes late design changes and their associated cost and schedule impacts.

7.

Improves teamwork and idea exchange among development team members.

A complete FMEA for a software product should contend with failures arising from the computing environment hardware, external systems, and data processing transactions, and their effects on the final system or operational processes. The software FMEA procedures should adhere to the following steps, adapted from IEC 608121:

1.

Define the software boundaries for analysis (accomplished during computational requirements analysis).

2.

Understand the software requirements, functionality, and performance.

3.

Develop the functional architecture representations (hierarchical decomposition and behavioral views).

4.

Identify functional failure modes and summarize failure effects.

5.

Develop criteria for successful failure detection, isolation, and recovery.

6.

Report findings.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124077683000112

System Recovery

Philip A. Bernstein, Eric Newcomer, in Principles of Transaction Processing (Second Edition), 2009

Software

This brings us to software failures. The most serious type of software failure is an operating system crash, since it stops the entire computer system. Since many software problems are transient, a reboot often repairs the problem. This involves rebooting the operating system, running software that repairs disk state that might have become inconsistent due to the failure, recovering communications sessions with other systems in a distributed system, and restarting all the application programs. These steps all increase the MTTR and therefore reduce availability. So they should be made as fast as possible. The requirement for faster recovery inspired operating systems vendors in the 1990s to incorporate fast file system recovery procedures, which was a major component of operating system boot time. Some operating systems are carefully engineered for fast boot. For example, highly available communication systems have operating systems that reboot in under a minute, worst case. Taking this goal to the extreme, if the repair time were zero, then failures wouldn’t matter, since the system would recover instantaneously, and the user would never know the difference. Clearly reducing the repair time can have a big impact on availability.

Some software failures only degrade a system’s capabilities, not cause it to fail. For example, consider an application that offers functions that require access to a remote service. When the remote service is unavailable, those functions stop working. However, through careful application design, other application functions can still be operational. That is, the system degrades gracefully when parts of it stop working. A real example we know of is an application that used a TP database and a data warehouse, where the latter was nice to have but not mission-critical. The application was not designed to degrade gracefully, so when the data warehouse failed, the entire application became unavailable, which caused a large and unnecessary loss of revenue.

When an application process or database system does fail, the failure must be detected and the application or database system process must be recovered. This is where TP-specific techniques become relevant.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978155860623400007X

A Deep Dive into NoSQL Databases: The Use Cases and Applications

Siddhartha Duggirala, in Advances in Computers, 2018

2.5.3 Commit Processing

To guard against any software failures or power failures, it is essential to backup database. This ensures the durability and consistency guarantees. As the backup is done regularly, it is prudent to keep a log of transaction activity. Owing to volatility of the memory, the log must be stored on stable storage. Logging of data to persistent storage impacts the response times. Each transaction has to wait till the log is committed to storage. And the throughput is also affected if log becomes bottleneck. This is the only disk operation required by the IMDBs [34,35]. The following are few solutions that have been suggested and some of them are used by major IMDBs:

Use stable storage for logging: A stable storage is used to hold portion of the log. As a transaction commits, the log information is written to stable storage. A dedicated processor asynchronous writes the data to log disks. Response time improves even though log bottleneck is still present. A modern variation of this approach is to use nonvolatile RAM storage. This device offers lesser read/write than disk storages.

Precommitting transactions: Transaction locks are released as soon as its log record is placed in the log. This reduces the blocking delays for other concurrent transactions.

Group commits: A log record is not sent to disk as soon as transaction commits. Instead the log records of several transactions are accumulated in memory. All these records are flushed to log disk in a single-disk operations. This reduces the number of disk operations. The log bottleneck and blocking wait are also eliminated. The downside is that in case of software failures some transactions might be lost.

Asynchronous commits: The log records are flushed to database asynchronously. The transactions need not wait for log to be committed to disk. As with the group commit, certain transactions might be lost in case of failures.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245818300135

Connected Computing Environment

Debraj De, ... Song Tan, in Advances in Computers, 2013

7.12 Over-the-Air Network Reprogramming

After a field deployment, the network functionality may need improvement or fix new software failures. Thus, it is important to support remote software upgrades. Deluge [16] is the de facto network reprogramming protocol that provides an efficient method for disseminating code update over the wireless network and having each node program itself with the new image. Deluge originally does not support the iMote2 platform, and it is not trivial to port it to support the iMote2 platform. Deluge was also improved to ensure that it could handle some adverse situations. Following are such important features. (1) Image integrity verification: If a node reboots during the download phase, it has to be ensured that it correctly resumes the download. To address this issue, a mechanism is implemented where the image integrity during startup is verified. If the image has been completely downloaded, then the system continue with the normal operations; otherwise it erases the entire downloaded image and reset the meta data to enable a fresh re-download. (2) Image version consistency: The original Deluge is based on sequence number. However, if the gateway lost track of the sequence number and did not use a higher sequence number, then Deluge will not respond to new request of code update. This problem was fixed by using the compilation timestamp to differentiate new image from old image.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124080911000014

Checkpointing

Thomas Sterling, ... Maciej Brodowicz, in High Performance Computing, 2018

20.4 Summary and Outcomes of Chapter 20

Applications with long execution times run a significant risk of encountering a hardware or software failure before completion.

Long execution times also frequently violate supercomputer usage policies where a maximum wallclock limit for a simulation is established.

The consequences of a hardware or software failure can be very significant and costly in terms of time lost and computing resources wasted for long-running jobs.

At designated points during the execution of an application on a supercomputer, the data necessary to allow later resumption of the application at that point in the execution can be output and saved. This data is called a checkpoint.

Checkpoint files help mitigate the risk of a hardware or software failure in a long-running job.

Checkpoint files also provide snapshots of the application at different simulation epochs, help in debugging, aid in performance monitoring and analysis, and can help improve load-balancing decisions for better distributed-memory usage.

In HPC applications, two common strategies for checkpoint/restart are employed: system-level checkpoint and application-level checkpointing.

System-level checkpointing requires no modifications to the user code but may require loading a specific system-level library.

System-level checkpointing strategies center on full memory dumps and may result in very large checkpoint files.

Application-level checkpointing requires modifications to the user code. Libraries exist to assist this process.

Application-level checkpoint files tend to be more efficient, since they only output the most relevant data needed for restart.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124201583000204

Mining Software Logs for Goal-Driven Root Cause Analysis

Hamzeh Zawawy, ... John Mylopoulos, in The Art and Science of Analyzing Software Data, 2015

18.10 Conclusions

In this chapter, we have presented a root cause analysis framework that allows the identification of possible root causes of software failures that stem either from internal faults or from the actions of external agents. The framework uses goal models to associate system behavior with specific goals and actions that need be achieved in order for the system to deliver its intended functionality or meet its quality requirements. Similarly, the framework uses antigoal models to denote the negative impact the actions of an external agent may have on specific system goals. In this context, goal and antigoal models represent cause and effect relations and provide the means for denoting diagnostic knowledge for the system being examined. Goal and antigoal model nodes are annotated with pattern expressions. The purpose of the annotation expressions is twofold: first, to be used as queries in an information retrieval process that is based on LSI to reduce the volume of the logs to be considered for the satisfaction or denial of each node; and second, to be used as patterns for inferring the satisfaction of each node by satisfying the corresponding precondition, occurrence, and postcondition predicates attached to each node. Furthermore, the framework allows a transformation process to be applied so that a rule base can be generated from the specified goal and antigoal models.

The transformation process also allows the generation of observation facts from the log data obtained. The rule base and the fact base form a complete, for a particular session, diagnostic knowledge base. Finally, the framework uses a probabilistic reasoning engine that is based on the concept of Markov logic and MLNs. In this context, the use of a probabilistic reasoning engine is important as it allows inference to commence even in the presence of incomplete or partial log data, and ranking the root causes obtained by their probability or likelihood values.

The proposed framework provides some interesting new points for root cause analysis systems. First, it introduces the concept of log reduction to increase performance and is tractable. Second, it argues for the use of semantically rich and expressive models for denoting diagnostic knowledge such as goal and antigoal models. Finally, it argues that in real-life scenarios root cause analysis should be able to commence even with partial, incomplete, or missing log data information.

New areas and future directions that can be considered include new techniques for log reduction, new techniques for hypothesis selection, and tractable techniques for goal satisfaction and root cause analysis determination. More specifically, techniques that are based on complex event processing, information theory, or PLSI could serve as a starting point for log reduction. SAT solvers, such as Max-SAT and Weighted Max-SAT solvers, can also be considered as a starting point for generating and ranking root cause hypotheses on the basis of severity, occurrence probability, or system domain properties. Finally, the investigation of advanced fuzzy reasoning, probabilistic reasoning, and processing distribution techniques such as map-reduce algorithms may shed new light on the tractable identification of root causes for large-scale systems.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124115194000185

Why SDN?

Paul Goransson, Chuck Black, in Software Defined Networks, 2014

2.4.2 Inadequacies in Networks Today

In Chapter 1 we discussed the evolution of networks that allowed them to survive catastrophic events such as outages and hardware or software failures. In large part, networks and networking devices have been designed to overcome these rare but severe challenges. However, with the advent of data centers, there is a growing need for networks to not only recover from these types of events but also to be able to respond quickly to frequent and immediate changes.

Although the tasks of creating a new network, moving a new network, and removing a network are similar to those performed for servers and storage, doing so requires work orders, coordination between server and networking administrators, physical or logical coordination of links, network interface cards (NICs), and ToR switches, to name a few. Figure 2.4 illustrates the elapsed time in creating a new instance of a VM, which is on the order of minutes, compared to the multiple days that it may take to create a new instance of a network. This is because the servers are virtualized, yet the network is still purely a physical network. Even when we are configuring something virtual for the network, such as a VLAN, making changes is more cumbersome than in their server counterparts. In Chapter 1 we explained that although the control plane of legacy networks had sophisticated ways of autonomously and dynamically distributing layer two and layer three states, no corresponding protocols exist for distributing the policies that are used in policy-based routing. Thus, configuring security policy, such as ACLs or virtualization policy such as to which VLAN a host belongs, remains static and manual in traditional networks. Therefore, the task of reconfiguring a network in a modern data center does not take minutes but rather days. Such inflexible networks are hindering IT administrators in their attempts to automate and streamline their virtualized data center environments. SDN holds the promise that the time required for such network reconfiguration be reduced to the order of minutes, such as is already the case for reconfiguration of VMs.

The capacity of a system to keep functioning in the face of hardware or software failure is called

Figure 2.4. Creating a new network instance in the old paradigm.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124166752000024

Big Data

Deborah Gonzalez, in Managing Online Risk, 2015

Data cycle

Data, like most other business assets, has a life cycle (see Figure 5.1) governing its transition in terms of substance and use. Its stage of evolution also determines its value at any given time.

The capacity of a system to keep functioning in the face of hardware or software failure is called

FIGURE 5.1. The Data Life Cycle.

Create/capture

We can create data or capture data that already exists. Risks include human error from a manual data entry process or a software failure in automatic recording of input (i.e., behavior such as typing keys or URL tunneling). Intellectual property risks concerning data creation will be addressed in Chapter 6.

Index/classify

Indexing is a way to “point to the location of folders, files, and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.”12 Classifying data, explored in more detail below, allows us to categorize data by certain characteristics and/or formats for easier determination of value and access. Risks can include lost data based on nonexistent indexing or inappropriate resource allocation for security purposes because of misclassification of data that ends up devaluing its importance.

Store/manage (encryption)

Data has to be held somewhere in an electromagnetic form that can be accessed by a computer processor once it is collected and/or compiled. Storage options include business-owned devices, cloud service providers, hybrids, off-site physical backup storage, etc. Risks can include faulty security practices, server/hardware failure, cloud service interruption, and data breaches.

Retrieval/publish

Data needs to be extracted to be processed, analyzed, and used to make decisions. Queries can be set up to display information directly online (such as via apps) or in specific formatted reports (for later online sharing and/or actual printing). Risks can include unauthorized access, coding errors and bugs, redirection of electronic reports, misdirection of electronic information for printing or forwarding to a wrong device, errors and omissions from the published report, etc. We will discuss more on liability of publishing in Chapter 6.

Process

This is the handling and treatment of data to produce meaningful information. Risks can include malware threats, coding errors and bugs, unauthorized access, inaccurate results, etc.

Archive

Data does have a certain lifetime. At some point it becomes outdated, obsolete, and/or replaced with new more relevant data. Considering the costs of data storage and the potential liability for outdated data getting out, a protocol should be developed to determine what data should be archived, when, where, and for how long. Risks can include authorized access, inability to access vital information when required, archive device failure, etc.

Destroy

Destroy means to make the data unreadable. In today’s online digital environment can data really be destroyed never to be seen again?13 Cache systems, multiple backups, tagging by others, computer forensics, and other technology practices sort of hint at “not possible.” However, that does not stop businesses from attempting to clear out data storage and retire old devices (or even giving devices away without deleting the data it contained, such as in copiers with hard drives). Risks can include data in the wrong hands, stolen disposed of devices, violation of legal holds and discovery requests for litigation, etc.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124200555000050

How Elasticity Property Plays an Important Role in the Cloud

M.A.N. Bikas, ... M. Grechanik, in Advances in Computers, 2016

5.1 How to Maximize Resource Availability in the Cloud?

One of the main issues of cloud computing is the availability of resources in the cloud. Several factors should be addressed to ensure the high availability of the application on the cloud, including hardware and software failures, specifically single points of failures, network vulnerabilities, power outages, conservations, or denial of service invasions. An efficient deprovisioning of resource reduces the low availability of the services (eg, application) on the cloud. We summarize some of the solutions that address resource availability problem.

Armbrust et al. [1] discussed how availability is one of the top obstacles in cloud computing. Although the authors have analyzed the high-availability approaches used by cloud providers, they do not discuss any existing solutions that enable a cross-cloud resource provisioning model. Cloud providers (eg, GCP, Amazon Web Services, Microsoft Azure, GoGrid, and Rackspace) lack a common platform for cross-cloud provisioning.

Galante et al. [13] pointed out that the use of multiple clouds is one of the solutions to the resources availability issue. The CCIF [44] attempted to overcome the limitations of interoperability standards among various cloud platforms by introducing an open and standardized cloud interface to unify different cloud platform APIs. IEEE also has a similar project P2301 (Guide for Cloud Portability and Interoperability) [45], where the purpose of the project is to guide cloud users to develop and use cloud services using a common standard to increase portability and interoperability among different providers.

Buyya et al. [51] proposed architecture for cloud federation to integrate distributed clouds to meet business requirements. A federated cloud enables cloud providers to manage and deploy several external and internal cloud computing services. For example, it allows fulfilling the exceeding demands of a cloud by renting resources from other cloud service providers.

Pawluk et al. [52] developed an initial step toward the idea of a cloud of clouds [53] to enable an automated cross-cloud resource provisioning platform. They proposed a broker service that enables cross-cloud to facilitate the construction of application topology platform and runtime modification according to the objectives of an application deployer. In most cases, the assumption of acquired resources should be homogeneous. However, the authors eliminated this assumption to support an actual intercloud platform. An open project [54] for cross-cloud acquirement and VM management enables a developer to select which of the available clouds to use, whereas Pawluk et al. [52] select the available clouds to use for the developer.

An attempt to define unified access to multiple clouds via a unified API has been advanced by Refs. [44,45,54], whereas Refs. [51,52] introduced a methodology for the federation of cloud computing platforms. Both works do not discuss any implementation to automate the resource acquirement procedure via unified access to multiple clouds. References [44,45] presented limited support for interoperability and intercloud interfaces, but they did not provide any mechanism (eg, implementation) to automate the resource acquirement procedure via unified access to multiple clouds through APIs.

These studies have proposed solutions to the problems associated with resource availability of cloud systems. These solutions are very valuable to ensure the high availability of the services on the cloud systems to maintain elasticity.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/S0065245816300250

High Availability and a Data Warehouse

Lilian Hobbs, ... Pete Smith, in Oracle 10g Data Warehousing, 2005

17.8 Summary

In this chapter, we have discussed various aspects of improving the availability of the data warehouse. Oracle Database 10g provides features such as RAC to provide fault-tolerant operation in the face of hardware and software failures and allows logical and physical reorganization of data without requiring downtime. We discussed the role of disaster recovery in a data warehouse and also how the data warehouse fits into an enterprise disaster recovery strategy using Data Guard. Finally, we also touched upon the subject of information life-cycle management, which ensures that the data warehouse will continue to be cost effective even as data sizes grow.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781555583224500199

What is the best definition of security incident?

An occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of an information system or the information the system processes, stores, or transmits or that constitutes a violation or imminent threat of violation of security policies, security procedures, or acceptable use ...

Which type of security control includes backup and restore operations as well as fault tolerant data storage?

Recovery controls are designed to recover a system and returned to normal operation following an incident. Examples of recovery controls include system restoration, backups, rebooting, key escrow, insurance, redundant equipment, fault-tolerant systems, failovers, and contingency plans (BCP).

What is fault tolerance quizlet?

Fault tolerant (4/50) a system that is able to continue despite the existence of one or more faults.

Which of the following Linux directories contains executables that are used by the operating system and the administrators but typically not by normal users?

Which of the following Linux directories contains executables that are used by the operating system and the administrators but typically not by normal users? command-line interface.