A bunch of people have sent me links to an article about MapReduce. I've hesitated to write about it, because the currently hyped MapReduce stuff was developed, and extensively used by Google, my employer. But the article is really annoying, and deserves a response. So I'm going to be absolutely clear. I am not commenting on this in my capacity as a Google employee. (In fact, I've never actually used MapReduce at work!) This is strictly on my own time, and it's purely my own opinion. If it's the dumbest thing you've ever read, that's my fault, not Google's. If it's the most brilliant thing you've ever read, that's my credit, not Google's. I wasn't asked to write this by Google, and I didn't ask their permission, either. This is just me, the annoying geek behind this blog, writing solely
on my own behalf, speaking for no one but me. Got it?
The reason that I'm interested in this is because it's related to my PhD work. Back in
grad school, what I worked on was applying parallelism to structured data in
non-scientific applications. That's pretty much what MapReduce does. And the solution
that I proposed was a kind hierarchical scatter/gather operation - which is, very nearly, the
way that MapReduce works. The big difference? MapReduce beats what I did. The guys who designed MapReduce noticed something that I didn't, and took advantage of it, which made M/R programs a lot cleaner and easier to write. The article that I'm going to discuss criticizes M/R for exactly that key idea.
Let's start at the beginning. What is MapReduce? What does it do?
Suppose you're at work, and you need to do something that's going to take a long time to
run on your computer. You don't want to wait. But you don't want to go out and spend a couple
of million dollars buying a supercomputer. How do you make it run faster? One way is buy a
whole bunch of cheap machines, and make it run on all of them at once. Another is to notice
that your office has lots of computers - pretty much every office has a computer on the desk of every employee. And at any given moment, most of those computers aren't doing much. So why not take advantage of that? When your machine isn't doing much, you let you coworkers borrow the capability you're not using; when you need to do something, you can borrow their machines. So when you need to run something big, you can easily find a pool of a dozen machines.
The problem with that approach is that most programs aren't written to run on a dozen machines. They're written to run on one machine. To split a hard task among a lot of computers is hard.
MapReduce is a library that lets you adopt a particular, stylized way of programming that's easy to split among a bunch of machines. The basic idea is that you divide the job into two parts: a Map, and a Reduce. Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines - so all the pieces run at the same time. Reduce takes the results from the sub-parts and combines them back together to get a single answer.
The key to how MapReduce does things is to take input as, conceptually, a list of records. The records are split among the different machines by the map. The result of the map computation is a list of key/value pairs. Reduce takes each set of values that has the same key, and combines them into a single value. So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can't tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map. (The key idea that differentiates M/R from my PhD work is that realization that by treating results as key/value maps, you get a very clean, uniform reduction process. I got the idea of the scatter/gather pattern, but my reduces were really ugly in comparison to M/R.)
The beauty of MapReduce is that it's easy to write. M/R programs are really as easy as parallel programming ever gets.
So, getting back to the article. They criticize MapReduce for, basically, not being based on the idea of a relational database.
When I first wanted to spend some time learning about relational databases, my then boss told me a story about database people, which I've found to be remarkably true. The story is, RDB people have found the most beautiful, wonderful, perfect hammer in the whole world. It's perfectly balanced - not too heavy, not too light, and swings just right to pound in a nail just right every time. The grip is custom-made, fitted to the shape of the owners hand, so that they can use it all day without getting any blisters. It's also beautifully decorated - encrusted with gemstones and gold filigree - but only in places that won't detract from how well it works as a hammer. It really is the greatest hammer ever. Relational database guys love their hammer. It's just such a wonderful tool! And when they make something with it, it really comes out great. In fact, they like it so much that they think it's the only tool they need. If you give them a screw, they'll just pound it in like it's a nail. And when you point out to them that dammit, it's a screw, not a nail, they'll say "I know that. But you can't expect me to use a crappy little screwdriver when I have a magnificent hammer like this!"
That's exactly what's going on here. They've got their relational databases. RDBs are
absolutely brilliant things. They're amazing tools, which can be used to build amazing
software. I've done a lot of work using RDBs, and without them, I wouldn't have been able
to do some of the work that I'm proudest of. I don't want to cut down RDBs at all: they're truly great. But not everything is a relational database, and not everything is naturally suited towards being treated as if it were relational. The criticisms of MapReduce all come down to: "But it's not the way relational databases would do it!" - without every realizing that that's the point. RDBs don't parallelize very well: how many RDBs do you know that can efficiently split a task among 1,000 cheap computers? RDBs don't handle non-tabular data well: RDBs are notorious for doing a poor job on recursive data structures. MapReduce
isn't intended to replace relational databases: it's intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines. That's all it was intended to do.
To be specific, the criticisms from the DB guys were that MapReduce is:
- A giant step backward in the programming paradigm for large-scale data intensive applications
- A sub-optimal implementation, in that it uses brute force instead of indexing
- Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on
Point one is rubbish. M/R is a great way of programming some large-scale data
intensive applications. If you've got a large computation whose data is naturally described
relationally, and which is feasible to run on a single really big machine, and you have access
to said really big machine, then no question - you should be using an RDB. And for many
computations that seem too complex to run is a reasonable amount of time on a single machine, working out the proper layout in tables, with the proper indexing scheme, can make what seemed non-feasible downright easy. But that's not every computation. In fact, that's not even most computations. MapReduce isn't intended to replace the relational database.
Let me give an example. When I was writing about Fractals, one reader was fascinated by the BuddhaBrot, and decided to try generating an image of a related fractal, called the NebulaBrot. The basic idea of that is to take the set of points which are on the border of the mandelbrot set, and trace their paths as you perform the mandelbrot iteration. The more times a given point is crossed by one of those paths, the brighter you make that point. The end result is a very cool image. The problem with it is, it's not zoomable. The paths are erratic, and there's no way of figuring out which initial points are the roots of paths that are going to pass through a particular sub-area of the figure. So what the very bright reader did was build a huge NebulaBrot - the idea being that instead of starting small and then zooming, you start big and then squish. That gives you the ability to look at very high resolution views of small parts. Sort of like pre-zooming.
The way that he implemented the NebulaBrot was as a MapReduce. He needed to trace through paths of millions of points. That can take quite a bit of time. So what he did was divide up the initial points into subsets, and farm them out with a map. Each map instance computed the paths for some set of points. The result of the map was a set of pairs: points, and the number of times that paths had crossed those points. The points were the keys, the number of crossings were the values. The reduce took the set of point/crossing pairs, and summed up the crossings for each point. Then the result was translated into an image. Presto! Giant NebulaBrot, very quickly.
You can't do that in a relational database. It's not a relational application. It doesn't really start with row data - MapReduce just calls its sets of input data rows, because it's a nice metaphor; it doesn't mean that they're really tabular data - in the NebulaBrot, the
data isn't really rows, it's points. No relational tool is going to help with that - there's no scheme of careful, elegant table layout and indexing that's going to make that computation tractable on a single small machine. And relational tools aren't going to help make it run in parallel.
Moving on to point two, which was that MapReduce is primitive and sub-optimal, because it doesn't use indexing. Same basic response: indexing is a great tool if your data is tabular, and you have a central index that you can work with. But if your task isn't fundamentally relational, and what you really need is computation - like the massive numbers of floating point multiplications in the NebulaBrot - then indexes aren't going to help. The problem that MapReduce is trying to solve is making it easy to write a program that does a huge amount of computation in a reasonable amount of time. Indexes don't help if the essential task is computationally intense.
Point three: "it's not novel." No, stop, you must be kidding! It never claimed to be
novel. It's not supposed to be novel. In fact, the initial publication which
describes MapReduce specifically states that it was inspired by the Map and Reduce primitives from functional programming languages! MapReduce is, basically, a form of data
parallel computing - that is, a scalable implementation of a well known, well understood,
easy-to-use model of parallel programming. How on earth is that a problem?
Point four: "missing features in RDBMS". Isn't this really just a different way of saying
the initial criticism? It's really just another version of "It's not relational, so it's not
good". Once again: that's deliberate. MapReduce isn't a database application. If you've got a
relational database problem, use a relational database. If you've got a massive computational
task that you want to distribute among dozens, or hundreds, or thousands of small machines,
then use MapReduce. Those wonderful DBMS tools? They weren't designed for massively parallel
cluster/cloud computing. In fact, they don't work in the massive cluster
Point five: "incompatible with tools that DB developers have come to depend on". Are
you starting to see why I told the hammer story? This is just another restatement of the same old original criticism. It's not a relational database, so it's not good. No, database-oriented tools and processes don't work for MapReduce programming. In fact, what
makes for a good DB design will often make for a really piss-poor M/R design. They're not trying to do the same thing.
So in the end, is there anything worth taking from the database guys critique of
MapReduce? Frankly, I don't think so. They really don't seem to understand what M/R is
designed for, or why it's designed the way it is. They seem to have homed in on the fact that
M/R uses the terms "table" and "row", and concluded that it's intended to be something like
the kinds of tasks for which you'd use an RDBMS. But aside from the superficial resemblance in
terminology, there's very little connection between the two ideas. M/R is a great tool for a
particular family of computational tasks. That doesn't detract from RDBMS - relational
databases are also an amazing tool for a large family of computational tasks. The fact that
the relational database approach isn't the ideal solution to every task isn't the
fault of M/R. The fact that M/R uses similar terminology to RDBMS despite being intended for a very different family of tasks than RDBMS doesn't mean that M/R is wrong. M/R isn't an alternative to relational databases. It's something entirely different.
Just because you've got the best hammer in the entire world doesn't make everything a nail. If you've got a screw, even a cheap, old, rusty screwdriver is going to do a better
job. And MapReduce is a lot better than a cheap, old, rusty screwdriver.