How to Get it All Wrong Because of Big Data

Hadoop is no longer the elephant in the room

Hadoop is no longer the elephant in the room

I have heard, times and again, during the past year or so, a question that may appear naïve to the insiders. I have heard some managers asking themselves “should we create a real Data Warehouse storing all our data in Hadoop, replacing the old relational databases that we are using now?”

 

My reply is, usually, “where are you getting this idea from?”

One step back. The aim of Business Intelligence (as usual I include also DW under this umbrella term) is to support the control process. The aim of the control process is to create a model to describe the business in numerical terms, monitor the metrics and KPIs describing the business performance and, crucially, predict the effect of business decisions and external perturbations on those performances.

Hadoop is a piece of software, usually bundled with a lot of other stuff with funny names, that is able to store any sort of data (organized in files) and execute transformations or queries on them in a fast and efficient way. It does so by parallel processing, that is, sharing the processing burden with other Hadoops on different physical machines.

So, what has Hadoop to do with the business model mentioned before?

If you answered “nothing”, you are right.

Hadoop enables the model to be fed with an entire new class of data (sometimes called “black data” but generally known as “big data”) whose volume is big enough to be too expensive to be processed with traditional technologies. This is going to improve the model and to consider an entire new class of performance metrics but it is surely not going to replace its current implementation. Some of the most crucial data to manage every business are very “small” data (for example, the balance of bank accounts, which are numbers that practically everything has influence on) and there is no point in storing them in Hadoop.

Since the output of the model is always going to be a numeric output, and the quantities being the model input and output are naturally expressed in tabular form, Hadoop does not offer any decisive advantage over the classic relational databases in storing them and doing the calculations required by the model.

While, in principle, it is perfectly possible to rebuild a DW in Hadoop, there is actually no reason for doing so.
In addition, at least today, there is a certain level of impracticality that makes working with Hadoop a more bumpy solution than having a relational backend.

Let’s not forget that there are some other potential solutions that compete with Hadoop in the big data space and may offer alternatives not to be ignored, like Microsoft Parallel Data Warehouse, Oracle Exadata or SAP HANA. So Hadoop is just one of the new technologies that are enriching our world, a very important one, with a great future ahead, but just one.

So, the next time you hear someone asking “Should I replace my DW with an Hadoop based solution”, you know what to answer.

Enjoy!