Explaining MapReduce to My Distant Relatives

You need to understand what MapReduce is.

If you've never heard the term and you don't work in the tech sector, stick around. It's easy to understand, and it's important.

A one-line answer

Q: What is MapReduce? A: MapReduce is a counter-intuitive but very powerful way to answer questions and perform calculations.

Ok. But how does it work, and why does it matter? Here's one way to think about it.

Ants and Elephants: a quick analogy

Databases are behind most of the software that you encounter and care about. And the dominant life form in the world of databases, for the last several years have been relational databases. Your local library catalog system? Runs on a relational database. The software controlling the grocery store checkout lines (the ones that list the items that you bought and their prices) uses a relational database to keep track of everything. Wikipedia is a giant relational database (of HTML pages and edits to them).

Let's try the analogy again. Imagine that database related tasks (asking questions, counting, calculating) are similar to... carrying huge, heavy sacks of rice across a field.

Given that image, relational databases are smart, hard working elephants that can lift entire sacks and carry them across. They are well-trained and good at what they do. You point them to the sack of rice, give them a few commands, they lift the sack and carry it across the field. (If you're worried about animal rights... imagine that they're robot elephants).

MapReduce based systems, on the other hand, is like having a giant army of obedient ants at your command. If you need a sack of rice carried across, you compose a set of individual instructions (take a single grain of rice out of the sack, drag it across the field, and put it in a sack on the other side) and send it out to all the ants simultaneously. They surge into action, and although each ant is much weaker and simpler than an elephant, when the dust settles, the sack of rice still ends up transported across the field.

You can probably see what the drawbacks of working with ants are. It's much easier and more intuitive for a programmer (in charge of transporting rice across a field) to learn how to point a smart elephant to a large bag and tell it where to carry it. The elephant knows how to lift -- it's done it before, it knows how to walk and how to keep its eyes on the goal. And the rice stays together, in an intuitive logical grouping. The ants, on the other hand, have to be micro-managed. You have to direct them carefully on how to unload the rice, how to carry it across the field without bumping into other ants, and how to load it back into a sack. And if you're not careful, you'll end up with rice scattered all over the field.

So why is MapReduce so important? Well, there are several reasons, which we'll discuss in a bit. But I'll give you the first hint.

Ants are cheap and interchangeable. If the elephant falls sick one day, what are you going to do? The rice still has to get hauled across the field. Sure, you can keep a backup elephant around, to take up the slack while the first one recovers from an elephant cold. Except, now you have to buy (and feed) two elephants. And what if they both fall sick at the same time? If a single ant gets sick (or stepped on, or eaten)... it's much easier to replace, you order another bagful of ants, and off they go.

Keep that image, of ants and elephants, in the back of your mind. Meanwhile... what have we really explained here? Carrying grains of rice is easy to imagine, but it's a bit too abstract. How do you actually perform calculations with MapReduce? And what, again, are its advantages over relational databases?

Students versus Museum Directors, Fight!

Imagine that there is a book museum, with an extensive rare book collection.

And you wake up in the middle of the night, in cold sweat, and you absolutely must know: How many poetry books with red covers are there in the rare book collection? The success of your business depends on it.

Here's the traditional way to get the answer to that question:

Answering questions, the relational database edition

You call up the museum's Director. And you pose that question to him - How many poetry books with red covers are there in your collection?

(Here's what you must know about the Director. He's trained all of his life to answer these kind of questions. He's really fast at counting. He has a great memory, and a complete map of the library and all of its shelves in his head. He reads at a Guinness World Record level speed, and his movements are efficient and precise.)

The Director frowns. He's usually well prepared for questions like these. For example, he has some common questions already researched and pre-computed; if you merely asked him "How many poetry books are in your collection?", all he would have to do is to consult his ledger, with neat totals of all the books by section -- he wouldn't even have to leave his office to answer. Color, however? That's not in the ledgers.

But no matter. This is what he does best. He switftly moves to the Poetry section, and walking the shelves methodically, he scans all of the poetry books in his museum, one by one, counting the red books on display. He can count really high without losing his place, he does not stumble or miss a book. Pretty soon, he comes back to his office with an answer, and phones you with an exact total of red poetry books in his museum.

Not only that, but, assuming that you'll ask that question again, he can add a 'Cover Color' column to his ledgers, and order his assistants to start keeping track of red and green and yellow cover totals, tabulated by section and by author and everything. The next time you ask, he won't even have to leave his office to answer.

This is how the world of traditional relational databases works. They are good at what they do. There is a powerful, smart, precise Director of whom you can ask questions (if you know how to speak his language). Traditional databases were an astoundingly useful technology, and still are. They shape much of the modern world.

But there is another way to get the answer to the question about book covers. This is MapReduce:

Answering questions, the MapReduce edition

Imagine for a second, that you had access to a nearly limitless, very inexpensive labor pool, that was not skilled in anything in particular, but highly trained to follow directions. Like, say, teenagers fresh out of highschool. (If you're worried about teenager rights.. imagine that they're robot highschool students).

You hire a large group of them. Maybe twice as many people as there are poetry books in the museum, plus a little more on top of that. You divide them into two teams.

The first is the Map team. (You can give them red armbands, to tell them apart from the second team).

You arm the Map team with some very simple tools. Each one of them gets a blank paper index card, and a pen. And you give each of them an identical set of directions.

Their directions are: at the appointed hour, every member of the Map team streams into the museum and heads to the Poetry section. Each team member lines up in front of one book. (Pretend for a second that the museum is spatious enough to accomodate them all). Once in position, each member simply writes down the color of the cover of the book they're standing in front, and puts a 1 next to it. Like so:

(Index card of Map team member #1 contains) Red: 1 (Index card of Map team member #2 contains) Black: 1 and so on, until each book is recorded, one index card per book.

After all the books have been recorded, the Map team heads out of the museum, each carrying their index card, each with a single color and the count: 1.

That's it! That's all the directions given to each member of the Map team. Go in. Find one book (no overlapping, no duplicates, no book left behind). Record the color of its cover. Put a count of 1 next to it. Get out.

Now comes the handoff to the Reduce team (who is also armed with pens, and blank index cards).

The Reduce team (you can give them blue armbands) stands outside the museum and collects the Map team's index cards. They stand in rows, like a pyramid, a long front line ready to greet the incoming Map team. Then a smaller row behind that. The rows reduce in size, until there is one final member of the Reduce team all the way in the back, standing holding a phone.

The directions given to the Reduce team are a tiny bit more complex, but still easy for individual member to understand:

Take the index card that's being passed to you (by the Map team coming out of the museum, or by a Reduce team member before you). Look at the color written on it, and throw away anything that's not Red. Go through all the fileld-out cards in your posession, and for every color, add up the totals for it. (Though since you threw away all the other colors, you'll only have Red totals, in this example). Write down each color and its sum on your own card, and pass it back one row.

Eventually, the totals start adding up as the cards move through the rows. Finally, the last Reduce student receives the two semi-final cards with the two subtotals, adds them up, and calls you with the answer.

The work that each student does is easy. They just throw away unneeded colors, add two numbers together, and pass on their work down the line. Most importantly, they don't have to keep a long-running total in their head, they can't lose count (all the relevant information is written down before them), and they don't care what the other students are doing.

This is how you answer the question "How many red poetry books are there in the museum?", MapReduce style.

Now you may be wondering: Who does that? Why would you do ridiculous things like write down a single color and put the numeral 1 next to it? Can't you get the students to start counting books right away, or something?

This is what I mean by counter-intuitive. MapReduce requires a slightly different mindset that may seem strange and intimidating at first. But after a couple of examples, you get the hang of it, I promise. You catch on.

But the initial confusion is worth it. MapReduce is a skill and a mindset worth learning. (And if you're not a programmer, it's a skill worth teaching to somebody in your organization).

Why is it worth learning? Why do I keep saying that MapReduce is important? I'll tackle that in another post.

Until then, I'll leave those answers as an exercise to the reader. Just keep in mind the parameters set down in this artificial analogy. The difference between museum directors and generic high school students. The amount of training each one gets, the ease of recruitment, and the amount of salary each one demands. Keep in mind the ease of talking to an intelligent Director, versus the hassle of hammering out foolproof individual instructions, and wranging hordes of students yourself, but also the opportunity that the second method presents.

Computing Joy

Dmitri Zagidulin