Big data, Big data Analytics and Hadoop Framework

Taruna

Assistant Professor - Department of CS Kalinga University, New Raipur

Big data is data that’s too big, too fast, or too hard for existing tools to process. Nowadays, organizations need to deal with collections of data of petabyte-scale that come from various sources like click streams, transaction histories, sensors etc. This data needs to be processed very fast so that results can be generated and action can be taken in real time. Vast Quantities of data is generated daily from various sources. This has led to the development of many more frameworks that are efficient to handle such huge data. Apache Hadoop is one such framework. It is an open-source application of Google MapReduce. It is a highly available open-source data storage engine which provides a combination processing as well as storage. Hadoop uses Hadoop distributed file system (HDFS) for storage purposes and for processing it uses MapReduce. The

ecosystem of Hadoop is managed by Apache software foundations. MapReduce is a simple model for managing large amount of data in a distributed and parallel manner. One MapReduce cluster can contain a large number of machines even thousands. The data is stored in

HDFS and is sent for processing to Mapreduce. Hadoop provides a framework to work with petabytes of data which is distributed over a large number of nodes. The main idea is to store large files in small numbers rather than storing more files of small size. One of the major advantages of using Hadoop is that it provides both the storage and the processing facility. But it is able find and handle the errors only in the initial layer.

HDFS (Hadoop Distributed File System): HDFS handles the storage part in Hadoop. It distributes multiple copies of data blocks to the nodes for better reliability and faster computations. HDFS works on master/slave architecture. It stores the data for computation. HDFS has two main nodes which are the Name node and the data nodes. MapReduce fetches the data stored in HDFS for computation.

Hadoop MapReduce: MapReduce is the processing framework of Hadoop. MapReduce nodes can process a huge amount of data in parallel. The data is processed in two stages- Map and Reduces stage.

Name node – The name node is the master node in HDFS master–slave architecture. The main task of the name node is to monitor the working of data nodes and maintaining the metadata i.e. several data blocks, block locations, free nodes, storage information, replication etc. The metadata helps in faster retrieval of the data. It is also responsible for monitoring the slave nodes, and assigning tasks to them.

Data nodes- The data nodes are the slave to the master node and i.e.

they are the slave nodes. Data nodes stores the actual data and are employed on each machine.

They are in direct contact with the name node and serve the read and writes requests

made by the client. Namenode is a master node and keeps the meta data rather than the actual data. The actual data is stored in the data nodes. The main work of the data node is to report the namenode by sending the heartbeat signals. The data nodes create the block replicas and delete them as instructed by name node.

It is said that every two years the information in the world is doubling. Social networking sites such as, Facebook, Twitter, Google and even YouTube generate huge amount of data and information. Efficiently storage and retrieval of this information is a difficult task. Not only is this data huge in size but is highly unstructured in a variety of formats. Traditional databases are capable to store large quantities of data and manipulate them but the data is mostly structured. These traditional databases are incapable to deal with the modern data which is unstructured and huge in size. This data is known as Big data and has the following characteristics such as Velocity, Variety and Volume. Thus we can say that there is an exponential growth of data which is both structured and unstructured, being generated from a variety of sources like social media, e-mails, sensors, flight data, surveillance reports etc. Big Data can be structured, semi-structured, as well as unstructured and generated by many digital and non-digital resources. The effective use the Big Data is a challenge as it is used for efficient decision making by using technology and other data mining techniques.

Big Data analytics can be defined as the process by which companies examine Big Data sets comprising a range of data. With the help of Big Data it is possible for companies to precisely analyze large volumes of data and reveal hidden patterns, correlations, trends to understand consumer preferences and other information.

Architecture of Hadoop

Hadoop is a framework that utilizes large clusters of commodity hardware for maintaining and storing data of big size. Hadoop works on MapReduce Programming Algorithm introduced by Google. Hadoop is used by many companies like Facebook, Yahoo, Netflix for managing big data. The Architecture of Hadoop consists of 4 components namely MapReduce, HDFS, YARN, and Hadoop Common.

CONCLUSION

Apache Hadoop was created out of necessity with the passage of time, as data from the web exploded, and grew beyond the ability of traditional systems to handle it. Hadoop was created for handling an avalanche of data, and is now the most popular standard for storing, processing and analyzing huge quantities of data in terabytes, and petabytes. Apache Hadoop is open source, and pioneered a new way of storing and processing data. It does not rely on expensive, proprietary hardware and different systems to store and process data, rather it allows distributed parallel processing of huge amounts of data across inexpensive and industry-standard servers for both the storage and processing of data, which is highly scalable.

Big data, Big data Analytics and Hadoop Framework

Taruna

Like this article?

Leave a Comment Cancel Comment