Map Reduce Explained
In this post, I'm going to write about Map Reduce Algorithm.
Map Reduce has taken the computing world by storm in recent years. It's massive scalability has given an edge to solving complicated problems.
Map Reduce is often explained by the famous counting example (word count more often than not). I would also take the same route, albeit, a different example., let us count the number of books with the same name.
Example:
Suppose, there are 3 book shelves and each of them have different books.
For brevity, let us say there are 4 different books in each of these shelves.
"The Da Vinci Code" (count: 1)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 2)
"The Da Vinci Code" (count: 2)
"Think and Grow Rich" (count: 3)
"The Bridges of Madison County" (count: 1)
"The Da Vinci Code" (count: 4)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 1)
Following is the pseudo code for the map function:
function map(books){
Loop thru all books
emit book.name as key, 1 as value.
}
Let us apply the map function for the first shelf. Note that since there are 3 "Harry Potter and the Half-Blood Prince" books, there will be 3 key/value pairs for this book each with value 1. The mapper will not (in this example), try to aggregate the counts.
Similarly, a mapper is associated with all the three shelves, and similar key/value pairs are generated for each shelf.The key/value pairs as in the following diagram will be generated.
In this step, all the map entries with the same key will be assigned to one reducer. For eg, all "Harry Potter and the Half-Blood Prince" key/value entries will be assigned to one reducer.
A reduce function will receive a key and a list of aggregated values as input.
For the first shelf, it will receive "Harry Potter and the Half-Blood Prince" as key input and a list of [1,1,1] as value input.
function reduce(name as key, List of values){
Loop thru each value in List of values
sum += value
End Loop
emit name, sum
}
Following diagram shows the 3 Reducers in action:
So, we get the book name "Harry Potter and the Half-Blood Prince" and the corresponding count 3 as value.
Depending on requirement, the system can have multiple mappers and reducers, using a master/slave architecture, with the master deciding on shuffling and providing data to each reducer.
There are different frameworks which provide Map/Reduce programming model, Hadoop being one of them.
Map Reduce has taken the computing world by storm in recent years. It's massive scalability has given an edge to solving complicated problems.
Map Reduce is often explained by the famous counting example (word count more often than not). I would also take the same route, albeit, a different example., let us count the number of books with the same name.
Example:
Suppose, there are 3 book shelves and each of them have different books.
For brevity, let us say there are 4 different books in each of these shelves.
First Shelf:
"Harry Potter and the Half-Blood Prince" (count: 3)"The Da Vinci Code" (count: 1)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 2)
Second Shelf:
"Harry Potter and the Half-Blood Prince" (count: 2)"The Da Vinci Code" (count: 2)
"Think and Grow Rich" (count: 3)
"The Bridges of Madison County" (count: 1)
Third Shelf:
"Harry Potter and the Half-Blood Prince" (count: 1)"The Da Vinci Code" (count: 4)
"Think and Grow Rich" (count: 2)
"The Bridges of Madison County" (count: 1)
Map function:
The map function will create a map with a key and value. In our case, since we are counting books of each name, we can have the book name as the key. Let us put a value of 1 for each book.Following is the pseudo code for the map function:
function map(books){
Loop thru all books
emit book.name as key, 1 as value.
}
Let us apply the map function for the first shelf. Note that since there are 3 "Harry Potter and the Half-Blood Prince" books, there will be 3 key/value pairs for this book each with value 1. The mapper will not (in this example), try to aggregate the counts.
Similarly, a mapper is associated with all the three shelves, and similar key/value pairs are generated for each shelf.The key/value pairs as in the following diagram will be generated.
Shuffling:
In this step, all the map entries with the same key will be assigned to one reducer. For eg, all "Harry Potter and the Half-Blood Prince" key/value entries will be assigned to one reducer.
Reduce:
A reduce function will receive a key and a list of aggregated values as input.
For the first shelf, it will receive "Harry Potter and the Half-Blood Prince" as key input and a list of [1,1,1] as value input.
function reduce(name as key, List of values){
Loop thru each value in List of values
sum += value
End Loop
emit name, sum
}
Following diagram shows the 3 Reducers in action:
So, we get the book name "Harry Potter and the Half-Blood Prince" and the corresponding count 3 as value.
Overall Architecture:
Depending on requirement, the system can have multiple mappers and reducers, using a master/slave architecture, with the master deciding on shuffling and providing data to each reducer.
There are different frameworks which provide Map/Reduce programming model, Hadoop being one of them.
Comments
Post a Comment