It’s all about the data

by swtouw

Hadoop, Storm, Hbase, Cassandra, … – powerful stuff, it’s the new age of distributed big data tools that anyone that’s anyone should be using.

However, we’ve all heard something like this before: “My CIO said I had to buy some Hadoop, so now we are using HDFS as our data backup system because we didn’t know what else to do with it”.

Yup.  If you don’t think that’s happening – ignorance is bliss.

Technology pivots are scary.

Leadership understands the value of big data, however, the risk in moving whole hog is great.  Instead, big data typically ends up in the corner wearing birkenstocks – a side project that’s never going to see the light of day.  It was doomed from the start…not because it isn’t great, but because it’s not solving a real problem.  With the exception of the Googles, Facebooks, Linkedins of the world – those that know their product depends on it – the big data technology pivot is scary, damn scary.

Why?

Learning Curve.

Security.

Let’s start with security first.  If you’re going to “do” big data, build correlations, make predictions, drive more value to your customers – your data is going to get you there, but that’s never going to happen in data stovepipes across a large organization.  SOA pretends to solve this, but you’ll never get there with SOA – this is worth a whole other blog, but if someone claims they can MapReduce across a SOA backed enterprise system, good luck.

To me, the big data technology pivot is about commoditizing your data.  It’s not necessarily the tool (Hadoop, Storm, etc), but the fact that your developers and analysts can now “play” with all the data – easily; you’ve created data-as-a-service for your organization.  In a SOA world, or even worse stovepiped world, you’ve added data procurement and refresh, you’ll never get where you need to be.

Commoditization of data sounds great, right?  Except that not everyone in your organization has authorization to every piece of data.  You need fine grained access controls.  It’s not that NoSQL can’t give you this, it can, just like relational databases can – it’s a context problem.  The big data tools are there to build value the relational databases are there to build an application.  The latter feels safer, doesn’t it?

Stove-piped applications are dead.

The killer app is the fact that you can rapidly build lenses into your data in minutes and add functionality in days – because you’ve commoditized your data.  Applications are lenses to your data, they are a mashup of functions and views – it’s all about the data.

Learning curve – this one is pretty simple, the talent just isn’t there in most organizations.  This is a combination of open source moving so fast, fear of the technology pivot, and churn of developers working legacy applications.

Unfortunately you can’t just buy Hadoop and expect magic to happen.  You need to invest the developer capital on top of that Hadoop investment.  Administration has also been a challenge, distributed applications are not simple to install, monitor, nor maintain.

In recent history, there has been a push to make the administration leg easier.  Tools like Puppet, Mesos, YARN, Kubernetes, Slider, and Docker are saving head pounding around the world.  On demand, as a service big data tools are here, even on-premise, this is the first step in collapsing the learning curve problem.

What’s left to do?

We need the glue between the applications and the data tier.

API.

By providing an abstraction API on top of your data tier, you can enforce security while also commoditizing your data.  This isn’t a new idea, Google App Engine, Cloudant, and Continuuity have touched this.  Focus on indexing patterns, developers shouldn’t need to worry about if it’s Hbase or Cassandra under the covers.  However, the current hosted solutions don’t cut it.  It ends up being a hosted stovepipe.

And no, this isn’t the same thing as SOA.  The abstractions provide fine grained query, in a common way, but also coarse grained access for large scale analytics.  You aren’t keeping up with some other application’s API, you are all working from the same cloth.

Our team has spent the last year working on EzBake.  The idea behind EzBake is to create that abstraction layer API between your apps and your data that also enforces object level security.  This allows technology swap under the API without breaking the API contract with the developers.

More importantly, the abstraction layer allows us to distribute questions across the data returning results – with application context.  We now have indexes to enable application building (the lenses) but at the same time, provide the data commoditization.  What the heck, we also threw in the classic PaaS provisioning of the applications and abstraction layer.

Easy.  Secure.

EzBake will be open sourced soon.  Our hope is the community can contribute to our software; we call it a data productivity platform, not PaaS.  We just have the starting point, there’s work left to do, and we need to do it together.

Let’s make big data a happy place with unicorns and rainbows.  It’s too scary right now.

After all, it’s all about the data.