Big Data == Big Variety & Big VariabilityPosted: July 18, 2012
I read a decent amount about Big Data in the trade blogs. There’s a singular assumption among all tech authors that Big Data is new because the Volume and Velocity of data were just too big previously to handle these volumes of incoming data. That’s patently false. I worked in Telecom for nearly 10 years, where data volumes, which were already coming in in a structured form, exceed, both in motion and at rest, what would classify as Big Data today. Techniques were developed to very successfully deal with the Volume and Variety, even providing detailed and quick reporting across Terabytes or Petabytes at rest.
That’s not to say that technology was easily accessible to everyone. The legacy BI vendors will sell it you, but the costs do exceed what most enterprises can afford. What I will say Hadoop and others have done in the Big Data space is commoditize the space to the point where its now accessible to less than the Fortune 100. Where all the solutions fall down until we start getting into true Big Data stores like Hadoop, Splunk, Cassandra, Mongo, etc, is dealing with Variety and Variability.
The Achilles heel of legacy BI technologies is requiring that data to be put into a structure at insert-time. The structure also needs to be consistent amongst that data type and all data types must be known in advance. This is variety. We don’t know when we begin to build our data warehouse what interesting data we’ll find. What we know is that we want to stuff it somewhere so that it can later be added to the soup that is our intelligence gained from disparate data types.
What happens when that data format changes? Again, this is the true differentiator for Big Data technologies. Anyone who has worked in a legacy BI environment can tell you, simply adding a field to a report can take months, especially if that change requires going all the way through the architecture to add the field to all the processing steps from the raw data. Dealing with variability is non-existent in legacy BI environments.
The Data Warehouse of the future won’t sit in a structured database. Sources of that data will exist in structured forms because it will make sense to maintain financials in an ACID-compliant RDBMS, but all that data will be fed into an unstructured repository, likely HDFS, as files that can simply be joined, semi-structured, to other semi-structured data through query languages which are likely in their infancy or not yet written. The concepts of rigid tables, OLAP cubes, etc, will be foreign and the time sunk into writing ETL processes, managing environments full of copies of that structure to test migrating changes, etc, will be reclaimed for use in gaining intelligence about the Enterprise.
The big promise of Big Data isn’t that we’re going to see so much more; the big promise is that we’ll suddenly be free of the shackles of legacy technology that’s been stretched well beyond its logical breaking point! The man-decades of lost productivity industry-wide will suddenly be freed and unleashed like a tsunami. Insight will abound, decisions will be informed, and IT will finally begin to deliver on promises made decades ago to deliver true, real, value to their businesses. All because we’re going to free the most valuable IT asset, the IT worker, now give a more glamorous title of Data Scientist, from the horrendously unproductive way we’ve been managing our data for the last 20 years. Be prepared, the revolution is coming.