Today the kids and I were discussing how fast Santa has to travel to get to everyone’s house in one night. I asked some questions and got the typical “1 million hundred times fast daddy!”…but this got me thinking.
Santa’s elf engineers need to get a Hadoop cluster.
Santa’s Christmas eve duties are a perfect Hadoop MapReduce use case. I always have trouble explaining MapReduce, and the obligatory word count example, well, it’s just counting words, it doesn’t resonate with anyone. We’ve heard “MapReduce brings the code to data”, or “you can analyze all your data at once”. To the lay person that can barely spell Hadoop, this is pretty meaningless, and if you read one of my earlier posts, just “buying” Hadoop doesn’t get you anywhere. I like this use case, so I figured I’d put it into words.
So here we are, another Christmas eve, and Santa has to visit every child’s house (well, Christian child) in sequence. If Santa’s journey tonight was done in code, it would look something like this:
while ( more Christian houses ) –> get children –> for each child –> get child wish list –> for each wish –> deliver gift
Santa probably can’t even fit all of that into memory to even have this code work, but even if he could, it would take a while, even in code. This would be a classic vertical scaling solution, beef up the server with memory and processing, and you probably could run this fast…but it will never scale as far as the horizontal approach, aka Hadoop MapReduce.
As you may see, this is a perfect use case for parallelization, and really, that’s all MapReduce is, a very tidy abstraction of parallelization, it’s an easy to understand pattern where you map by something, and then reduce by what you just mapped. Huh?
Ok, so think of it this way, let’s pretend that Santa could pick up houses and drop them in several available “processing plants” manned by hundreds of elves, those processing plants can deliver gifts to the house, and then put the houses back. So what would be the most efficient way to organize which houses go to which of Santa’s processing stations?
By the wished-for gift.
By the way, you just created your first MapReduce mapper, you map by the wished gift. So, let’s back up for a minute and look at the data so this can make a little more sense, here is a sample file Santa has to work with:
House 1, Bobby, Train
House 1, Lucy, Doll
House 1, Lucy, Train
House 2, Steve, Train
House 3, Lily, Train
House 3, Isla, Doll
For example sake, let’s pretend there’s only 2 gifts that Santa can deliver, looking at the above list, and using the logic we just came up with, we only need two processing stations, one for Trains and one for Dolls. House 1 would actually go to two stations (yep, we are now duplicating houses, but they are already duplicated in the file). House 2 would go to the Train processing station only, House 3 would also go to both.
We could have mapped by house, but this would have been inefficient, because there are 3 houses and only 2 gifts, so we would have needed an extra processing plant…oh, and by the way, this is already what Santa is doing – each house. However, it would still be faster using this paradigm since he’d be doing the houses in parallel at least, but not as efficient as having processing plants only focused on gifts. It also makes the code a lot less complicated…for example:
If we map by house, the processing plant has to:
get children –> for each child –> get child wish list –> for each wish –> deliver gift
If we map by gift, the processing plant has to only:
–> deliver gift
There is a downside to this though, think about if we had 2 billion houses and only two possible gifts. Each processing plant now has to deliver ~1 billion gifts to all the houses flowing to it…we are back to a memory issue. Of course, there’s millions of gifts, so this isn’t a real issue for Santa, but it could be an issue in a different use case that needs to be considered.
So now we have two sets of code, the mapper:
parse gift from the above file –> key on this for delivery to the processing plant
the reducer (the processing plant):
–> deliver gift
So, you’re probably wondering about where that mapper code runs…think of it this way, and this is what people mean when they say you “push the code to the data”, that mapper code will get pushed out to every Christian neighborhood (or file in HDFS) and run that process. The code is running where the data/file lives on the server, and these are all running in parallel at all the neighborhoods at once – think of it as the elves going out to every neighborhood to do the work at once. However, depending on your Hadoop cluster size (the amount of survey elves you have), some of the neighborhoods will have to wait in line while the elf finishes the one in front of it. (Note there is also a limit to processing plants based on your cluster size, so processing plants may not exist until others are done delivering gifts).
So as an elf processes a house, it’s picked up and copied to one or more processing plants. So here’s where some slow down can occur because of the copying and shuffling. And you really don’t want to copy the entire house if you can avoid it – keep these objects small.
There’s some other considerations here, like if you “map” by the gift, how do you know which house it belongs to (or even which kid) in the processing plant? You basically lose that history if all you send is the gift to the processing plant (the reducer). You can do things like add the house and kid as part of the “value” that is sent, like gift and the array of houses:kids that wished it. In fact, that’s why the word count example is elegant, because if all you’re doing is counting, you don’t need to worry about mapping back to which book it came from. Another important point here is that this is a separate output from the processing plant (reducer), you are not actually putting the house back, you are just delivering the gift in the house clone that went to the plant…are we talking mirror universes here??
To recap, Santa’s Christmas eve now goes like this:
Santa’s survey elves go out to every neighborhood (file) in the world (HDFS) and determine the wished for gifts and transport that house to the appropriate processing plant based on the gift(s), which may include clones of the house (mapper).
Each processing plant then receives the houses and drops in the gift it is responsible for, a pretty simple task. Note that the more complex task is left to the survey elves, since they are local to the houses (data locality) and they only send what the processing plant needs to do it’s job (reducer).
Ok, that sounds even crazier than what I just told my kids, but hell, it would work a lot better. Now I’m going back to my Christmas eve champagne, my wife is starting to look at me funny.