Hello friends, how are you? It is good to see you again. This article is a sort of continuation of our previous article on MapReduce. So first timers, I would recommend that you please refer the previous article first. All the technologies follow some architecture. Don’t you wonder about the MapReduce architecture? So let us begin with the architecture first. The MapReduce architecture consists of three different types of servers, each of them having their own unique tasks.
The user tasks are assigned by the master server. These tasks are assigned to map servers and reduce servers. The state of these tasks is tracked or monitored by the master server.
The user inputs are accepted by the map servers and map operations are performed on them. The map servers write the results to the intermediate files.
Accepting the intermediate files from the map servers and performing reduce operations on them is the responsibility of the reduce servers.
So for example, if you feel like counting the total number of words in all the web pages, you will have to feed all the stored pages from GFS to MapReduce. This won’t happen with the help of one or two machines. 1000s of machines will be used simultaneously and all the scheduling, failure handling, coordination, maintenance, and data transfer would be done automatically. So if we summarize the whole process in a few words like a timeline, then starting from GFS it will lead to Map, from Map to Shuffle, from Shuffle to Reduction, and from Reduction to Storage. The results will then come back to GFS. A map will map one view of data to another map. This will produce a key-value pair. In our example, the key-value pair will be word and count. The shuffling will help in aggregating the key types and the reduction will sum up all the key-value pairs which will lead to the final answer.
Once we are well versed in the architecture of MapReduce, understanding the execution becomes easier. So in MapReduce, the Map invocations are distributed across multiple machines. The distribution is automatically done. The input data is divided into M split sets. This helps in parallel execution by different machines. The reduce invocations are also distributed according to the intermediate key space into R pieces. So how is this distribution done? It is done using some partitioning function like hash into brackets key mod R. The total number of partitions which is denoted by R and the function is user-specified as per the requirement. The overall flow of the MapReduce operation can be seen in the image in the transcript. The process begins when the user program calls the MapReduce function. Then onwards, the complete sequence is executed. Please have a look at the image, we will describe the complete sequence as per the corresponding sequence number.
Number 1: The top-most position in the image, the execution begins from here. The user program contains the MapReduce library which helps in executing the user program. Once the program executes, it splits the input files into M different pieces. The size of M varies from 16 megabytes to 64 megabytes per piece. The size can be user controlled. Many copies of the program on different machines are now running.
Number 2: The next step is an important copy of the program known as the master. So one remains as the master copy and the others remain as workers. Master assigns work to them. M map tasks and R reduce tasks are assigned. Like any other real-world scenario, the master picks up idle workers and assigns map or reduce tasks.
Number 3: The idle worker who is then assigned a map task, reads the content from the input split. The key/value pairs are parsed from the input data and each pair is passed through the user-defined Map function. The intermediate key/value pairs are produced and buffered in the memory.
Number 4: The buffered pairs are written to the local disk and partitioned into R regions. Again, the partition is carried out using the partition function. The buffered pairs are stored on the local disk. So their locations are then passed to the master. The locations are then forwarded to the reduce workers by the master.
Number 5: It is notification time, so the master notifies about locations to the reduce worker. Remote procedure calls are used to read the buffered data from the local disks of the map workers and once a reduce worker reads all the intermediate data, it will sort it with the help of intermediate keys. Due to this, all the occurrence of a same key group together. After this, sorting is carried out because many different keys are mapping on the same reduce task. Generally, an external sort is used for this because the amount of intermediate data is large.
Number 6: Now, the reduce worker iterates over sorted intermediate data. For every unique intermediate key encountered, the corresponding key and set of values are passed to the user side Reduce function. The output is then appended to an output file.
Number 7: This is the last step and all the map reduce tasks have been completed. Now, the master finally wakes up the user program and the MapReduce call returns back to the user code. The procedure ends here.
The output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these R output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.
Let us discuss. The indexing pipeline of Google includes 20 different map reductions. The data is approached in terms of a bunch of records and aggregate keys. Then, there is another MapReduce, it takes the result and processes in its own way. The cycle continues. How big are the MapReduce programs? Most of the time, they are as small as 20 to 50 lines of code. The transferred data is compressed because servers aren’t CPU bound. So data compression and decompression are feasible. It helps in saving the bandwidth and I/O.
MapReduce is being used in numerous applications such as search, ads, etc. These operations are quite important and effective. So almost every application uses it.
We can summarize the reasons why MapReduce is being used widely in 3 points. First, MapReduce offers an efficient way for task partitioning across many machines. Second, MapReduce is an expert at handling machine failures. Third, using MapReduce, the computation automatically moves closer to the IO source.
But is there any alternative to MapReduce? It is important to understand that MapReduce is a framework or a concept and Google MapReduce implements this MapReduce concept. So there is no copyright over MapReduce. It is like I am using the Pythagoras theorem to solve a problem. Anyone can use it. Like Google MapReduce, there is Hadoop MapReduce which is open source. It is implemented in Java while Google MapReduce is implemented in C++. Hadoop MapReduce runs on top of Hadoop Distributed File System which is also known as HDFS and Google MapReduce runs on top of GFS. So yes, it can be considered as an alternative.
The article begins with the discussion on the architecture of MapReduce. It basically consists of three type of servers which are known as the master, map, and reduce servers. Each of them has their own unique role and they simplify the task with the help of key-value pairs. Once we understand the architecture, we move on to the execution part. So we discuss the execution overview of MapReduce. Here, by referring to the image of the sequential process, the whole process is described step-wise. Here, 7 steps or procedure levels are described.
The usage of MapReduce is the next thing. By discussing the usage, we understand the value of MapReduce, how important it is, and how widely it is being used. We conclude the article by mentioning Hadoop MapReduce as an alternative to Google MapReduce. So this is all about MapReduce and I enjoyed discussing it with you. See you soon guys with another interesting topic.
Here is the link to the previous article from this series.