Hello friends, have you ever heard this word MapReduce? If you have gone through the Big Table articles, I had mentioned MapReduce. I had also mentioned that we will discuss it later on in a separate article. So here we are with MapReduce. It is yet another amazing Google infrastructure technology. MapReduce is extremely helpful when it comes to scaling data. The traditional databases don’t scale it to the MapReduce extent. So let us understand MapReduce and its concepts.
While I was working at Google, I witnessed that Google has implemented hundreds of special-purpose computations. They process a large amount of raw data every second like crawled documents, web request logs, etc for computing numerous types of derived data and producing precise results. This derived data could be anything like inverted indices, representations based on graph structure of web documents, summaries of the total number of pages crawled, the number of web visits per day or hour for a particular high web-traffic website, etc. If you give it a thought, although the result is quite straightforward, the input data is large and the computations are distributed across thousands of machines. The number of machines and the time to produce the result is inversely proportional. The main concerns are to parallelize the computation, data distribution, and handle failures that conspire to obscure the original simple computation.
To tackle all these concerns, Google came up with a new abstraction in early 2000. It permits the developers to express the simple computations while hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. All this is inspired by the map and the reduce primitives which were already present in Lisp and some other functional languages. Here, the major operations consist of applying a map operation to each logical record. This helps in computing a set of intermediate key/value pairs which are then subjected to reduce operation. All the values that share the same key are reduced. So the derived data are combined in an appropriate manner. With the help of a functional model, the user-specified map and reduce operations parallelize large computations easily. This further helps to improve fault tolerance by facilitating re-execution as the primary mechanism.
The computational model is exceptional, it takes a set of input key-value pairs and produces a set of corresponding output key-value pairs. The MapReduce library user considers the computation as a mixture of two different functions. These are the Map and Reduce function. The Map Function is from the user side or is written by the user. It takes an input pair and produces an intermediate corresponding key/value pair. Then, all the intermediate values are grouped together by MapReduce library. These values are associated with the same intermediate key and are passed to the Reduce function. Now the Reduce Function is also written by the user and it readily accepts the key and the corresponding values. Then, it combines these values and comes up with something that is smaller than the existing set of values. In fact, zero or one output value is produced per Reduce Invocation and these reduced values are supplied to the user reduce function. Here, an iterator helps to supply the values. This is how large values are handled which otherwise won’t fit!
So we can consider MapReduce as a programming model that processes as well as generates large datasets. This process is easy to understand. The user processes key-value pair and come up with intermedia key-value pairs. These are then processed with a Reduce Function, it merges all the values with the same intermediate key. This model has helped in solving many real-world tasks. For example, programs that follow this approach are automatically parallelized and executed on a large cluster of commodity machines. Every detail is considered, monitored, and taken care by the run-time system. For example, details like partitioning the input data, scheduling program execution, managing required inter-machine communication, and more. This approach gives an additional advantage to the newbie programmers because the parallel distributed systems easily utilize the resources of a larger distributed system. Hence, this approach is widely followed.
I always believe that theoretical part can be nailed easily with the help of practical examples. So we will have a look at some simple interesting programs which will help us understand the MapReduce computations.
You know one of google’s job is to monitor the site access activity so that Google know the actual visits for any pages in the web.
Consider that you need to count the URL access frequency of your website. For this, the Map Function can process the logs of web page requests. Their outputs in the form of a pair with URL being the key and numeric number 1 being the value. The web page requests and outputs in the same format are processed by the Map Function. So here, the URL is key while the count is value. The Reduce Function is helpful in adding all the values for the same URL and will ultimately come up with a combination of URL and total count pair. So this is how the URL Access Frequency can be obtained.
For calculating the pagerank, reverse web-link graph is necessary. For this, the map function outputs pairs for each link to a target URL. Here, the URL is found from the page named source. Here, the target URL is the key and the source URL is the value. To concatenate the list of all source URL’s that are associated with a given target URL, the Reduce Function is used. It will finally give an output pair with the key being the target and value being list of source.
This article is all about MapReduce which is the next best thing to discuss after GFS and BigTable. So we begin the article by understanding the MapReduce basics. What is MapReduce and how it evolved as a solution to the problems that existed? How the Map and Reduce operations helped in parallelizing the computation, data distribution, and handling failures. Then, we understand the computational model which takes a set of input key-value pairs and produces a set of corresponding output key-value pairs. It helps in processing as well as generates large datasets.
Once we discuss the model, it was time to solidify our understanding with the help of examples. So we discussed four different real-time examples like the Counting URL Access Frequency, and the Reverse Web-Link graph. These examples prove that MapReduce technology is worth implementing and it ends up producing efficient results. We still have a lot more to discuss on MapReduce. So see you guys in the next article which is the continuation of this article.
Here is the link to the previous article from this series.