There's No "I" in Data

A blog about data + technology and sometimes other things…

Month: June, 2014

1 Picture = 1,000 Words. 1 Picture x Earth’s Surface x 365 x 2 = Knowledge

Look at an overhead picture of anywhere at any time.

Sit back and think about the possibilities for a second.

Thinking about going to the gym?  Check the parking lot to see how crowded it is first.

Wondering what side of the road the accident ahead of you is on?  Look at the image of the road ahead.

Trying to find where your buddies are tailgating?  Look for their car before you drive over.

Long ago, in an old job I can’t talk about, high resolution pictures taken from outer space was magic stuff…

It wasn’t that long ago, though.

Now, we take it for granted that we can look at an image of our house on Google Maps, or even a street view of the address for your meeting tomorrow.

Typically these satellite images are years old, maybe months if you’re lucky.  Google buys them from a commercial satellite company like DigitalGlobe (another previous employer) and fuses them into their globe for your enjoyment – the acquisition, transfer, and fusion takes time, so you’ll never see an image hours old.  Hi my house last summer!

Until now.

Well, not now, but really soon.

Satellites are the easy part, at least it seems like it is now.  Check out the latest Google acquisition, Skybox.  What does Google want with a Satellite company? — more like, what doesn’t Google want with a Satellite company!?

In fact, they aren’t a satellite company at all, and their founders agree.  They believe they are creators of knowledge – from data that happens to be collected across the globe, twice a day by 2016.  Will you be able to get it in minutes or on-demand by 2020?  That doesn’t seem to be crazy talk anymore.

You want to talk about big data – that’s a hell of a lot of pixels to store and index every day.  This is the kind of massive data dreams are made of.  Not only must Skybox and Google figure out how to index this data effectively for rapid retrieval (wonder if Google will do that well?), but they must be able to analyze it.

This isn’t some college grad on a light table – no, to create knowledge, the image data must be analyzed in automated ways, at scale.

Talks of a developer API for this data has been rumored.  Maybe that means on-demand query of image by location at first.  But let your imagination go crazy for a second — what about running algorithms over the same orthorectified area over the past year to create time series statistics.  That starts getting really complex for the API provider, a framework that not only provides access to that data, but scales your algorithms to process the data.  Dream job?

The future should be about providing not only data services, but data-analytic services.  Developer APIs should be about the logic of what they are trying accomplish, only.  Let the API service provider worry about data access, scaling of jobs, and indexing of results.

So, yeah, Google could keep this to themselves and build their own hedge fund based on this new source of knowledge no one else has (and also build some damn good maps).  Or, they could open up a platform API that allows retrieval and analysis of this data to build knowledge I can’t even imagine.

I hope they do the latter.

Quick update (7/30/2014):  Just saw Will Marshall of Planet Labs’ exceptional talk at OSCON, very exciting stuff…looks like this is going to happen, here’s the video.

It’s all about the data

Hadoop, Storm, Hbase, Cassandra, … – powerful stuff, it’s the new age of distributed big data tools that anyone that’s anyone should be using.

However, we’ve all heard something like this before: “My CIO said I had to buy some Hadoop, so now we are using HDFS as our data backup system because we didn’t know what else to do with it”.

Yup.  If you don’t think that’s happening – ignorance is bliss.

Technology pivots are scary.

Leadership understands the value of big data, however, the risk in moving whole hog is great.  Instead, big data typically ends up in the corner wearing birkenstocks – a side project that’s never going to see the light of day.  It was doomed from the start…not because it isn’t great, but because it’s not solving a real problem.  With the exception of the Googles, Facebooks, Linkedins of the world – those that know their product depends on it – the big data technology pivot is scary, damn scary.

Why?

Learning Curve.

Security.

Let’s start with security first.  If you’re going to “do” big data, build correlations, make predictions, drive more value to your customers – your data is going to get you there, but that’s never going to happen in data stovepipes across a large organization.  SOA pretends to solve this, but you’ll never get there with SOA – this is worth a whole other blog, but if someone claims they can MapReduce across a SOA backed enterprise system, good luck.

To me, the big data technology pivot is about commoditizing your data.  It’s not necessarily the tool (Hadoop, Storm, etc), but the fact that your developers and analysts can now “play” with all the data – easily; you’ve created data-as-a-service for your organization.  In a SOA world, or even worse stovepiped world, you’ve added data procurement and refresh, you’ll never get where you need to be.

Commoditization of data sounds great, right?  Except that not everyone in your organization has authorization to every piece of data.  You need fine grained access controls.  It’s not that NoSQL can’t give you this, it can, just like relational databases can – it’s a context problem.  The big data tools are there to build value the relational databases are there to build an application.  The latter feels safer, doesn’t it?

Stove-piped applications are dead.

The killer app is the fact that you can rapidly build lenses into your data in minutes and add functionality in days – because you’ve commoditized your data.  Applications are lenses to your data, they are a mashup of functions and views – it’s all about the data.

Learning curve – this one is pretty simple, the talent just isn’t there in most organizations.  This is a combination of open source moving so fast, fear of the technology pivot, and churn of developers working legacy applications.

Unfortunately you can’t just buy Hadoop and expect magic to happen.  You need to invest the developer capital on top of that Hadoop investment.  Administration has also been a challenge, distributed applications are not simple to install, monitor, nor maintain.

In recent history, there has been a push to make the administration leg easier.  Tools like Puppet, Mesos, YARN, Kubernetes, Slider, and Docker are saving head pounding around the world.  On demand, as a service big data tools are here, even on-premise, this is the first step in collapsing the learning curve problem.

What’s left to do?

We need the glue between the applications and the data tier.

API.

By providing an abstraction API on top of your data tier, you can enforce security while also commoditizing your data.  This isn’t a new idea, Google App Engine, Cloudant, and Continuuity have touched this.  Focus on indexing patterns, developers shouldn’t need to worry about if it’s Hbase or Cassandra under the covers.  However, the current hosted solutions don’t cut it.  It ends up being a hosted stovepipe.

And no, this isn’t the same thing as SOA.  The abstractions provide fine grained query, in a common way, but also coarse grained access for large scale analytics.  You aren’t keeping up with some other application’s API, you are all working from the same cloth.

Our team has spent the last year working on EzBake.  The idea behind EzBake is to create that abstraction layer API between your apps and your data that also enforces object level security.  This allows technology swap under the API without breaking the API contract with the developers.

More importantly, the abstraction layer allows us to distribute questions across the data returning results – with application context.  We now have indexes to enable application building (the lenses) but at the same time, provide the data commoditization.  What the heck, we also threw in the classic PaaS provisioning of the applications and abstraction layer.

Easy.  Secure.

EzBake will be open sourced soon.  Our hope is the community can contribute to our software; we call it a data productivity platform, not PaaS.  We just have the starting point, there’s work left to do, and we need to do it together.

Let’s make big data a happy place with unicorns and rainbows.  It’s too scary right now.

After all, it’s all about the data.