The value of data increases when pieces of data are connected to produce information.

Energy Efficiency in Data Centers and Clouds

Farhad Mehdipour, ... Bahman Javadi, in Advances in Computers, 2016

3.3.3 Processing and Analysis Tools and Techniques

Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. In the following, we review some tools and techniques, which are available for big data analysis in datacenters.

As mentioned in previous section, big data usually stored in thousands of commodity servers so traditional programming models such as message passing interface (MPI) [40] cannot handle them effectively. Therefore, new parallel programming models are utilized to improve the performance of NoSQL databases in datacenters. MapReduce [17] is one of the most popular programming models for big data processing using large-scale commodity clusters. MapReduce is proposed by Google and developed by Yahoo. Map and Reduce functions are programmed by users to process the big data distributed across multiple heterogeneous nodes. The main advantage of this programming model is simplicity, so users can easily utilize that for big data processing. A certain set of wrappers is being developed for MapReduce. These wrappers can provide a better control over the MapReduce code and aid in the source code development. Apache Pig is a structured query language (SQL)-like environment developed at Yahoo [41] is being used by many organizations like Yahoo, Twitter, AOL, LinkedIn, etc. Hive is another MapReduce wrapper developed by Facebook [42]. These two wrappers provide a better environment and make the code development simpler since the programmers do not have to deal with the complexities of MapReduce coding.

Hadoop [43,44] is the open-source implementation of MapReduce and is widely used for big data processing. This software is even available through some Cloud providers such as Amazon EMR [96] to create Hadoop clusters to process big data using Amazon EC2 resources [45]. Hadoop adopts the HDFS file system, which is explained in previous section. By using this file system, data will be located close to the processing node to minimize the communication overhead. Windows Azure also uses a MapReduce runtime called Daytona [46], which utilized Azure's Cloud infrastructure as the scalable storage system for data processing.

There are several new implementations of Hadoop to overcome its performance issues such as slowness to load data and the lack of reuse of data [47,48]. For instance, Starfish [47] is a Hadoop-based framework, which aimed to improve the performance of MapReduce jobs using data lifecycle in analytics. It also uses job profiling and workflow optimization to reduce the impact of unbalance data during the job execution. Starfish is a self-tuning system based on user requirements and system workloads without any need from users to configure or change the settings or parameters. Moreover, Starfish's Elastisizer can automate the decision making for creating optimized Hadoop clusters using a mix of simulation and model-based estimation to find the best answers for what-if questions about workload performance.

Spark [49], developed at the University of California at Berkeley, is an alternative to Hadoop, which is designed to overcome the disk I/O limitations and improve the performance of earlier systems. The major feature of Spark that makes it unique is its ability to perform in-memory computations. It allows the data to be cached in memory, thus eliminating the Hadoop's disk overhead limitation for iterative tasks. The Spark developers have also proposed an entire data processing stack called Berkeley data analytics stack [50].

Similarly, there are other proposed techniques for profiling of MapReduce applications to find possible bottlenecks and simulate various scenarios for performance analysis of the modified applications [48]. This trend reveals that using simple Hadoop setup would not be efficient for big data analytics, and new tools and techniques to automate provisioning decisions should be designed and developed. This possibly can be a new service (i.e., big data analytics as-a-service) that should be provided by the Cloud providers for automatic big data analytics on datacenters.

In addition to MapReduce, there are other existing programming models that can be used for big data processing in datacenters such as Dryad [51] and Pregel [52]. Dryad is a distributed execution engine to run big data applications in the form of directed acyclic graph (DAG). Operation in the vertexes will be run in clusters where data will be transferred using data channels including documents, transmission control protocol (TCP) connections, and shared memory. Moreover, any type of data can be directly transferred between nodes. While MapReduce only support single input and output set, users can use any number of input and output data in Dryad. Pregel is used by Google to process large-scale graphs for various purposes such as analysis of network graphs and social networking services. Applications are introduced as directed graphs to Pregel where each vertex is modifiable, and user-defined value and edge show the source and destination vertexes.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/S0065245815000613

High-Performance Techniques for Big Data Processing

Philipp Neumann Prof, Dr, Julian Kunkel Dr, in Knowledge Discovery in Big Data from Astronomy and Earth Observation, 2020

7.5.2 Data Metrics: the Five Vs

Big Data processing is typically defined and characterized through the five Vs. The volume of the data, measured in bytes, defines the amount of data produced or processed. The velocity at which data are generated and processed (e.g., bytes per second) corresponds to another characteristic. Both data volume and velocity also play a role for computing and have similar metrics in this regard, with data velocity rather defined by throughput. The variety gives information on the diversity of data that are collected. This covers data format and structure (structured like a database, unstructured like human-generated text/ speech, or semistructured like HTML). Data can be rather similar (e.g., when collecting measurements through the same apparatuses from different sources), or extremely different, without any kind of obvious relation, with the latter turning out to exist and being essential after some processing of the data. Besides these three Vs (Pettey and Goasduff, 2011; Laney, 2001), two more characteristics have evolved that are frequently referred to in the Big Data context. The validity denotes the quality or actual trustworthiness in the data. For example, damaged data or incorrect values due to wrong data measurements may deteriorate a dataset. Finally, the value of data corresponds to the actual meaning of the data in a prescribed context. For example, data on customer satisfaction are very valuable for a company.

Big Data implies typically high volume, nonstatic, and frequently updated (velocity) data that involve a variety of data formats and particularly unstructured data, potentially wrong data (validity), and low value in its origin. Refinement is hence required to add value. Overviews on the Vs for Earth observation data can be found in Guo et al. (2015), Nativi et al. (2015). In particular, Nativi et al. (2015) conclude amongst others that heterogeneity “is really perceived as the most important challenge for Earth Sciences and Observation infrastructures.”

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128191545000175

Exploring the Evolution of Big Data Technologies

Stephen Bonner, ... Georgios Theodoropoulos, in Software Architecture for Big Data and the Cloud, 2017

14.7.2.3 Simplifying data centric development

Big data processing is typically done on large clusters of shared-nothing commodity machines. One of the key lessons from MapReduce is that it is imperative to develop a programming model that hides the complexity of the underlying system, but provides flexibility by allowing users to extend functionality to meet a variety of computational requirements. Whilst a MapReduce application, when compared with an MPI application, is less complex to create, it can still require a significant amount of coding effort. As data intestine frameworks have evolved, there have been increasing amounts of higher-level APIs which are designed to further decrease the complexities of creating data intensive applications. Current data intensive frameworks, such as Spark, have been very successful at reducing the required amount of code to create a specific application. Future data intensive framework APIs will continue to improve in four key areas; exposing more optimal routines to users, allowing transparent access to disparate data sources, the use of graphical user interfaces (GUI) and allowing interoperability between heterogeneous hardware resources.

Future higher-level APIs will continue to allow data intensive frameworks to expose optimized routines to application developers, enabling increased performance with minimal effort from the end user. Systems like Spark's Dataframe API have proved that, with careful design, a high-level API can decrease complexity for user while massively increasing performance over lower-level APIs.

Future big data application will require access to an increasingly diverse range data sources. Future APIs will need to hide this complexity from the end user and allow seamless integration of different data sources (structured and semi- or nonstructured data) being read from a range of locations (HDFS, Stream sources and Databases).

One, relatively unexplored, way to lower the barrier of entry to data intensive computing is the creation of GUIs to allow users without programming or query writing experience access to data intensive frameworks. The use of a GUI also raises other interesting possibilities such as real time interaction and visualization of datasets.

APIs will also need to continue to develop in order to hide the complexities of increasingly heterogeneous hardware. If coprocessors are to be used in future big data machines, the data intensive framework APIs will, ideally, hide this from the end user. Users should be able to write their application code, and the framework would select the most appropriate hardware to run it upon. This could also include pushing all or part of the workload into the cloud as needed.

For system administrators, the deployment of data intensive frameworks onto computer hardware can still be a complicated process, especially if an extensive stack is required. Future research is required to investigate methods to atomically deploy a modern big data stack onto computer hardware. These systems should also set and optimize the myriad of configuration parameters that can have a large impact on system performance. One early attempt in this direction is Apache Ambari, although further works still needs under taking, such as integration of the system with cloud infrastructure. Could a system of this type automatically deploy a custom data intensive software stack onto the cloud when a local resource became full and run applications in tandem with the local resource?

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128054673000144

Data-Driven Architecture for Big Data

Krish Krishnan, in Data Warehousing in the Age of Big Data, 2013

Processing Big Data

Big Data processing involves steps very similar to processing data in the transactional or data warehouse environments. Figure 11.5 shows the different stages involved in the processing of Big Data; the approach to processing Big Data is:

Figure 11.5. Processing Big Data.

Gather the data.

Analyze the data.

Process the data.

Distribute the data.

While the stages are similar to traditional data processing the key differences are:

Data is first analyzed and then processed.

Data standardization occurs in the analyze stage, which forms the foundation for the distribute stage where the data warehouse integration happens.

There is not special emphasis on data quality except the use of metadata, master data, and semantic libraries to enhance and enrich the data.

Data is prepared in the analyze stage for further processing and integration.

The stages and their activities are described in the following sections in detail, including the use of metadata, master data, and governance processes.

Gather stage

Data is acquired from multiple sources including real-time systems, near-real-time systems, and batch-oriented applications. The data is collected and loaded to a storage environment like Hadoop or NoSQL. Another option is to process the data through a knowledge discovery platform and store the output rather than the whole data set.

Analysis stage

The analysis stage is the data discovery stage for processing Big Data and preparing it for integration to the structured analytical platforms or the data warehouse. The analysis stage consists of tagging, classification, and categorization of data, which closely resembles the subject area creation data model definition stage in the data warehouse.

Tagging—a common practice that has been prevalent since 2003 on the Internet for data sharing. Tagging is the process of applying a term to an unstructured piece of information that will provide a metadata-like attribution to the data. Tagging creates a rich nonhierarchical data set that can be used to process the data downstream in the process stage.

Classify—unstructured data comes from multiple sources and is stored in the gathering process. Classification helps to group data into subject-oriented data sets for ease of processing. For example, classifying all customer data in one group helps optimize the processing of unstructured customer data.

Categorize—the process of categorization is the external organization of data from a storage perspective where the data is physically grouped by both the classification and then the data type. Categorization will be useful in managing the life cycle of the data since the data is stored as a write-once model in the storage layer.

Process stage

Processing Big Data has several substages, and the data transformation at each substage is significant to produce the correct or incorrect output.

Context processing

Context processing relates to exploring the context of occurrence of data within the unstructured or Big Data environment. The relevancy of the context will help the processing of the appropriate metadata and master data set with the Big Data. The biggest advantage of this kind of processing is the ability to process the same data for multiple contexts, and then looking for patterns within each result set for further data mining and data exploration.

Care should be taken to process the right context for the occurrence. For example, consider the abbreviation “ha” used by all doctors. Without applying the context of where the pattern occurred, it is easily possible to produce noise or garbage as output. If the word occurred in the notes of a heart specialist, it will mean “heart attack” as opposed to a neurosurgeon who will have meant “headache.”

You can apply several rules for processing on the same data set based on the contextualization and the patterns you will look for. The next step after contextualization of data is to cleanse and standardize data with metadata, master data, and semantic libraries as the preparation for integrating with the data warehouse and other applications. This is discussed in the next section.

Metadata, master data, and semantic linkage

The most important step in creating the integration of Big Data into a data warehouse is the ability to use metadata, semantic libraries, and master data as the integration links. This step is initiated once the data is tagged and additional processing such as geocoding and contextualization are completed. The next step of processing is to link the data to the enterprise data set. There are many techniques to link the data between structured and unstructured data sets with metadata and master data. This process is the first important step in converting and integrating the unstructured and raw data into a structured format.

Linkage of different units of data from multiple data sets is not a new concept by itself. Figure 11.6 shows a common kind of linkage that is foundational in the world of relational data—referential integrity.

Figure 11.6. Database linkage.

Referential integrity provides the primary key and foreign key relationships in a traditional database and also enforces a strong linking concept that is binary in nature, where the relationship exists or does not exist.

Figure 11.6 shows the example of departments and employees in any company. If John Doe is an employee of the company, then there will be a relationship between the employee and the department to which he belongs. If John Doe is actively employed, then there is a strong relationship between the employee and department. If he has left or retired from the company, there will be historical data for him but no current record between the employee and department data. The model shows the relationship that John Doe has with the company, whether he is either an employee or not, where the probability of a relationship is either 1 or 0, respectively.

When we examine the data from the unstructured world, there are many probabilistic links that can be found within the data and its connection to the data in the structured world. This is the primary difference between the data linkage in Big Data and the RDBMS data.

Figure 11.7 shows an example of integrating Big Data and the data warehouse to create the next-generation data warehouse. This is an example of linking a customer’s electric bill with the data in the ERP system. The linkage here is both binary and probabilistic in nature. This is due to the customer data being present across both the systems.

Figure 11.7. Connecting Big Data with data warehouse.

A probabilistic link is based on the theory of probability where a relationship can potentially exist, however, there is no binary confirmation of whether the probability is 100% or 10% (Figure 11.8). According to the theory of probability, the higher the score of probability, the relationship between the different data sets is likely possible, and the lower the score, the confidence is lower too. Additionally, there is a factor of randomness that we need to consider when applying the theory of probability. In a nutshell, we will either discover extremely strong relationships or no relationships. Adding metadata, master data, and semantic technologies will enable more positive trends in the discovery of strong relationships.

Figure 11.8. Probabilistic linkage.

Types of probabilistic links

There are multiple types of probabilistic links and depending on the data type and the relevance of the relationships, we can implement one or a combination of linkage approaches with metadata and master data.

Consider two texts: “long John is a better donut to eat” and “John Smith lives in Arizona.” If we run a metadata-based linkage between them, the common word that is found is “John,” and the two texts will be related where there is no probability of any linkage or relationship. This represents a poor link, also called a weak link.

On the other hand, consider two other texts: “Blink University has released the latest winners list for Dean’s list, at deanslist.blinku.edu” and “Contact the Dean’s staff via deanslist.blinku.edu.” The email address becomes the linkage and can be used to join these two texts and additionally connect the record to a student or dean’s subject areas in the higher-education ERP platform. This represents a strong link. The presence of a strong linkage between Big Data and the data warehouse does not mean that a clearly defined business relationship exists between the environments; rather, it is indicative of a type of join within some context being present.

Consider a text or an email:

From: [email protected]

Subject: bill payment

Dear sir, we are very sorry to inform you that due to your poor customer service we are moving our business elsewhere.

Regards, John Doe

With the customer email address we can always link and process the data with the structured data in the data warehouse. This link is static in nature, as the customer will always update his or her email address. This link is also called a static link. Static links can become a maintenance nightmare if a customer changes his or her information multiple times in a period of time. This is worse if the change is made from an application that is not connected to the current platform. It is easy to process and create static linkages using master data sets.

Another type of linkage that is more common in processing Big Data is called a dynamic link. A dynamic relationship is created on-the-fly in the Big Data environment by a query. When any query executes, it iterates through for one part of the linkage in the unstructured data and next looks for the other part in the structured data. The linkage is complete when the relationship is not a weak probability. In probabilistic linking we will use metadata and semantic data libraries to discover the links in Big Data and implement the master data set when we process the data in the staging area.

Though linkage processing is the best technique known today for processing textual and semi-structured data, its reliance upon quality metadata and master data along with external semantic libraries proves to be a challenge. This can be overcome over a period of time as the data is processed effectively through the system multiple times, increasing the quality and volume of content available for reference processing.

To effectively create the metadata-based integration, a checklist will help create the roadmap:

1.

Definition:

Data element definitions

Data element business names

Data element abbreviations/acronyms

Data element types and sizes

Data element sources

Data-quality observations

2.

Outline the objectives of the metadata strategy:

Goals of the integration

Interchange formats

Data-quality goals

Data scalability of processing

3.

Define the scope of the metadata strategy:

Enterprise or departmental

4.

Define ownership:

Who is the steward of the metadata?

Who is the program sponsor?

Who will sign off on the documents and tests?

5.

Define stewardship:

Who own the metadata processes and standards?

What are the constraints today to process metadata?

6.

Master repository:

A best-practice strategy is to adopt the concept of a master repository of metadata.

This approach should be documented, as well as the location and tool used to store the metadata. If the repository is to be replicated, then the extent of this should also be noted.

7.

Metadata maintenance process:

Explain how the maintenance of metadata is achieved.

The extent to which the maintenance of metadata is integrated in the warehouse development life cycle and versioning of metadata.

Who maintains the metadata (e.g., Can users maintain it? Can users record comments or data-quality observations?).

8.

User access to metadata:

How will users interact and use the metadata?

Once the data is processed though the metadata stage, a second pass is normally required with the master data set and semantic library to cleanse the data that was just processed along with its applicable contexts and rules.

Standardize

Preparing and processing Big Data for integration with the data warehouse requires standardizing of data, which will improve the quality of the data. Standardization of data requires the processing of the data with master data components. In the processing of master data, if there are any keys found in the data set, they are replaced with the master data definitions. For example, if you take the data from a social media platform, the chances of finding keys or data attributes that can link to the master data is rare, and will most likely work with geography and calendar data. But if you are processing data that is owned by the enterprise such as contracts, customer data, or product data, the chances of finding matches with the master data are extremely high and the data output from the standardization process can be easily integrated into the data warehouse.

This process can be repeated multiple times for a given data set, as the business rule for each component is different.

Distribute stage

Big Data is distributed to downstream systems by processing it within analytical applications and reporting systems. Using the data processing outputs from the processing stage where the metadata, master data, and metatags are available, the data is loaded into these systems for further processing. Another distribution technique involves exporting the data as flat files for use in other applications like web reporting and content management platforms.

The focus of this section was to provide readers with insights into how by using a data-driven approach and incorporating master data and metadata, you can create a strong, scalable, and flexible data processing architecture needed for processing and integration of Big Data and the data warehouse. There are additional layers of hidden complexity that are addressed as each system is implemented since the complexities differ widely between different systems and applications. In the next section we will discuss the use of machine learning techniques to process Big Data.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124058910000118

A Deep Dive into NoSQL Databases: The Use Cases and Applications

Pethuru Raj, in Advances in Computers, 2018

4.1.1 Apache Hadoop Framework

Apache Hadoop is a big data processing framework that exclusively provides batch processing. The latest versions of Hadoop have been empowered with a number of several powerful components or layers that work together to process batched big data:

HDFS: This is the distributed file system layer that coordinates storage and replication across the cluster nodes. HDFS is fault tolerant and highly available. It is used as the source of data, to store intermediate processed results, and to persist the final calculated results.

YARN (Yet another resource negotiator) is the cluster coordinating component of the Hadoop stack. It is responsible for coordinating and managing the underlying resources and scheduling jobs to be run.

MapReduce is the Hadoop's native batch processing engine.

A MapReduce job splits a large dataset into independent chunks and organizes them into key and value pairs for parallel processing. The mapping and reducing functions receive not just values, but (key, value) pairs. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/S0065245817300475

Infrastructure and technology

Krish Krishnan, in Building Big Data Applications, 2020

Big data processing requirements

What is unique about big data processing? What makes it different or mandates new thinking? To understand this better let us look at the underlying requirements. We can classify big data requirements based on its characteristics

Volume

Size of data to be processed is large; it needs to be broken into manageable chunks

Data needs to be processed in parallel across multiple systems

Data needs to be processed across several program modules simultaneously

Data needs to be processed once and processed to completion due to volumes

Data needs to be processed from any point of failure, since it is extremely large to restart the process from beginning

Velocity

Data needs to be processed at streaming speeds during data collection

Data needs to be processed for multiple acquisition points

Variety

Data of different formats needs to be processed

Data of different types needs to be processed

Data of different structures need to be processed

Data from different regions need to be processed

Complexity

Big data complexity needs to use many algorithms to process data quickly and efficiently

Several types of data need multi-pass processing and scalability is extremely important

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128157466000028

Introducing Big Data Technologies

Krish Krishnan, in Data Warehousing in the Age of Big Data, 2013

Big Data processing requirements

What is unique about Big Data processing? What makes it different or mandates new thinking? To understand this better let us look at the underlying requirements. We can classify Big Data requirements based on its five main characteristics:

Volume:

Size of data to be processed is large—it needs to be broken into manageable chunks.

Data needs to be processed in parallel across multiple systems.

Data needs to be processed across several program modules simultaneously.

Data needs to be processed once and processed to completion due to volumes.

Data needs to be processed from any point of failure, since it is extremely large to restart the process from the beginning.

Velocity:

Data needs to be processed at streaming speeds during data collection.

Data needs to be processed for multiple acquisition points.

Variety:

Data of different formats needs to be processed.

Data of different types needs to be processed.

Data of different structures needs to be processed.

Data from different regions needs to be processed.

Ambiguity:

Big Data is ambiguous by nature due to the lack of relevant metadata and context in many cases. An example is the use of M and F in a sentence—it can mean, respectively, Monday and Friday, male and female, or mother and father.

Big Data that is within the corporation also exhibits this ambiguity to a lesser degree. For example, employment agreements have standard and custom sections and the latter is ambiguous without the right context.

Complexity:

Big Data complexity needs to use many algorithms to process data quickly and efficiently.

Several types of data need multipass processing and scalability is extremely important.

Processing large-scale data requires an extremely high-performance computing environment that can be managed with the greatest ease and can performance tune with linear scalability.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124058910000040

A Taxonomy and Survey of Stream Processing Systems

Xinwei Zhao, ... Rajkumar Buyya, in Software Architecture for Big Data and the Cloud, 2017

11.4.3.4 Spring XD

Spring XD is a unified big data processing engine, which means it can be used either for batch data processing or real-time streaming data processing. It is now licensed by Apache as one of the free and open source big data processing systems. The goal of Spring XD is to simplify the development of big data applications. The Spring XD uses cluster technology to build up its core architecture. The entire structure is similar to the general model discussed in the previous section, consisting of a source, a cluster of processing nodes, and a sink. However, the Spring XD is using another term called XD nodes to represent both the source nodes and processing nodes. The XD nodes could be either the entering point (source) or the exiting point (sink) of streams. The XD admin plays a role of a centralized tasks controller who undertakes tasks such as scheduling, deploying, and distributing messages. Since Spring XD is a unified system, it has some special components to address the different requirements of batch processing and real-time stream processing of incoming data streams, which refer to taps and jobs. Taps provide a noninvasive way to consume stream data to perform real-time analytics. The term noninvasive means that taps will not affect the content of original streams.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128054673000119

System Optimization for Big Data Processing

R. Li, ... K. Li, in Big Data, 2016

9.8 Conclusions and Future Directions

Hadoop becomes the most important platform for Big Data processing, while MapReduce on top of Hadoop is a popular parallel programming model. This chapter discusses the optimization technologies of Hadoop and MapReduce, including the MapReduce parallel computing framework optimization, task scheduling optimization, HDFS optimization, HBase optimization, and feature enhancement of Hadoop. Based on the analysis of the advantages and disadvantages of the current schemes and methods, we present the future research directions for the system optimization of Big Data processing as follows:

1.

Implementation and optimization of a new generation of the MapReduce programming model that is more general. The improvement of the MapReduce programming model is generally confined to a particular aspect, thus the shared memory platform was needed. The implementation and optimization of the MapReduce model in a distributed mobile platform will be an important research direction.

2.

A task-scheduling algorithm that is based on efficiency and equity. The existing Hadoop scheduling algorithms consider much on equity. However, the computation in real applications often requires higher efficiency. Combining the system resources and the current state of the workload, fairer and more efficient scheduling algorithms are still an important research direction.

3.

Data access platform optimization. At present, HDFS and HBase can support structure and unstructured data. However, the rapid generation of Big Data produces more real-time requirements on the underlying access platform. Hence, the design of the access platform with high-efficiency, low-delay, complex data-type support becomes more challenging.

4.

Hadoop optimization based on multicore and high-speed storage devices. Future research should consider the characteristics of the Big Data system, integrating multicore technologies, multi-GPU models, and new storage devices into Hadoop for further performance enhancement of the system.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B978012805394200009X

Challenges in Storing and Processing Big Data Using Hadoop and Spark

Shaik Abdul Khalandar Basha MTech, ... Dharmendra Singh Rajput PhD, in Deep Learning and Parallel Computing Environment for Bioengineering Systems, 2019

11.6 Apache Storm

It is a distributed real-time big data processing system designed to process vast amounts of data in a fault-tolerant and horizontally scalable method with highest ingestion rates [16]. It manages distributed environment and cluster state via Apache ZooKeeper. It reads raw stream of real-time data from one end, passes it through a sequence of small processing units and outputs useful information at the other end. The components in Fig. 11.7 represent the core concept of Apache Storm.

Fig. 11.7. Application process of Apache Storm.

One of the main highlights of Apache Storm is that it is a fault-tolerant, fast with no “Single Point of Failure” (SPOF) distributed application [17]. The important high level components that we have in each Supervisor node include: topology, which runs distributively on multiple workers processes on multiple worker nodes; spout, which reads tuples off a messaging framework and emits them as a stream of messages or it may connect to Twitter API and emit a stream of tweets; bolt, which is the smallest processing logic within a topology. Output of a bolt can be fed into another bolt as input in a topology.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B978012816718200018X

Is are present when the value of a product or service increases as its number of users increases?

The network effect is a phenomenon whereby increased numbers of people or participants improve the value of a good or service.

Which type of database stores items that contain the data and the actions that read or process the data?

A relational database is a type of database that stores and provides access to data points that are related to one another. Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables.

Which of the following describes the cost that a consumer must bear when switching from one product to another?

Key Takeaways. Switching costs are the costs a consumer pays as a result of switching brands or products. Switching costs can be monetary, psychological, effort-based, and time-based. Switching costs can be classified as high switching costs or low switching costs.

Which of the following refers to facts that are assembled and analyzed to add meaning and usefulness?

Information refers to data or facts that are assembled and analyzed to add meaning. and usefulness.

Toplist

Neuester Beitrag

Stichworte