Dreams: Low Latency Hadoop
Hadoop works wonderfully, but it’s passé…the future is ___________
This really should be rephrased to Hadoop works wonderfully, but I need a low-latency solution that allows on-the-fly exploration of data. So what is that? It’s a combination of Hadoop and a sidecar that helps with this real time exploration, your “big data architecture”.
If you are worried about low-latency, then your problem just became a lot more complex, because you’ve now introduced the classic consistency problem. A great and popular blog entry by Nathan Martz describes the Lamba architecture, and notes that Hadoop can solve the consistency issue by relaxing the need for low latent results. However, most scenarios require the most up to date results possible, and hence the search for a Hadoop+ architecture.
- There is no single architectural solution.
- There is no “next thing” that is so much better or obvious.
Here’s a typical scenario requiring a big data architecture, from nearly all my customers. “I want to ask complex multi-dimensional questions against huge data sets”. Key to this is the fact that the questions they want their data to answer, aren’t necessarily explicit components in the data. What I mean by that is part of the multi-dimensional questions include derived data, from potentially disparate sources.
Hey, a great scenario for Hadoop.
So, back in 2011, we started writing custom Hadoop MapReduce jobs for every possible question an analyst came up with. Yeah, that’s scalable [insert sarcasm here]. It’s actually horrible on multiple levels, the analyst don’t trust you did it right (black box), you likely get too many results, and once you’re done, the job is difficult if not impossible for others to leverage.
So what to do?
We built Amino. Bear with me, Amino alone doesn’t solve the issue of latency in Hadoop (today), but is a building block to do so.
We noticed that many of our custom MapReduce jobs were built from the same boilerplate components. Hey, what if we pre-computed those components…and we indexed those in a way that allows analysts to mix and match them on the fly to ask these complex multi-dimensional questions as a query.
The special sauce, from a technical standpoint, is the Amino index, which is fully scalable and provides sub-second queries regardless the amount of dimensions. This is enabled through Accumulo iterators in combination with compressed bit vectors.
The special sauce, from a concept standpoint, is that developers from across an organization can create components (through MapReduce), and then contribute those to the Amino index. Now developers, that know nothing of each other, could have contributed components that together answer an analyst’s question.
The flow goes like this: at some increment, the contributed jobs run and spit their results to HDFS, once all the jobs complete, Amino sucks up all the results and does some conversions to index them in the “special” index.
I don’t want to downplay the power here. We were able to enable analysts to answer questions in seconds that prior took days to weeks or were not possible at all. But that’s not good enough, there are two issues:
1. Since the components are derived data, typically aggregates across the entire set, the jobs must run across everything, every time. Lamda Architecture attempts to solve this through sub aggregates, like storing results by hour and then aggregating up at query time. I would argue this is either severely limiting (only do counts by hour), or requires insanely complex logic in the query engine to account for every possible roll-up, particularly when considering multiple different MR outputs (the point of Amino).
2. Your results are only as new as the last time you ran the job(s).
This is where we are today, but I believe this is solvable in a Lambda-kinda way. In the Lambda architecture, the final step is to aggregate your real time data with the batch data as a query time roll up. As written in a great blog by Jay Kreps, he describes the Lambda downfall being the maintenance of the code in the real-time and batch in parallel. I would pile on that statement and say, as previously mentioned, what about the roll up code? If you are doing anything more complex than count by hour, this is going to get super hairy, even if you don’t have the real time stuff to worry about.
Well, what if we went ahead and did the aggregating (roll ups) in real time?
So there’s this thing called iterators in Accumulo, I mentioned them earlier, they are used to do our bitwise operations in the Amino index to enable our multi dimensional queries. Think of an iterator as a mini reduce step. At query time, your “map” is the keys of the query, the “reduce” is the iterator. These can also run on insert.
So, take a step back and remember Amino is a bunch of MapReduce results from different developers that act as components for analyst questions. Well, we could instead pre-set those components in the index, and update them with the iterators as data arrives. So, instead of the Amino API abstracting MapReduce, it should abstract Iterator roll-ups that occur at insert time.
So why is this better?
Remember that Amino offloaded the creation of components to various developers, they could do whatever they wanted in those jobs, as long as the output met the conventions to be effectively indexed in the Amino index. Similarly, if the developers implemented their own iterator roll ups, they can do whatever they want in there, as long as the output meets the requirements of the query.
This removes the issue of complex logic in the query time roll-ups and also removes Jay Kreps’ point of needing code to output the same thing in batch and stream.
But alas, as Jay mentions, iterators, just as stream processing, is not immediately consistent, and as Lamba, CAP theorem still exists.
The team is working it now…we’ll let you know how it goes.
Quick update (8/15/2014): Reading the Google Mesa paper and there are some striking similarities with what I describe to their data model, specifically versioned aggregation technique.