Real time BI is still perceived as one of the huge challenges that BI may face in an evolving corporate environment.
There are some companies that try to address it by adopting specific software designed according to specific real time concepts. While the outcome of implementing this software is often remarkable, low impact and low cost solutions are often readily available int the company landscape. As it can be easily guessed, it is the actual business case and the related financial figures that drive the implementation.
The domain in which real time BI technologies or methodologies are to be applied is often prone to misunderstandings.
We recall how we define as BI Business Intelligence: it is the complex of what is necessary (applications, data, processes) to support control.
The control of operational systems by real-time mimics and dashboard is not business intelligence. It is just an activity, not a decision making tool. These systems are meant to an end (the production of a certain amount of goods, the security of a compound etc.) but they do not help identifying what the end should be. Though, the data generated by them, when aggregated and transformed in metrics and analytics, are generally a factor in the decision making process.
A submarine has an extremely complex control centre, with thousands of systems reporting in real time data about their status. However, the commander grounds his decisions on far fewer inputs, deriving either from specific tools (communication systems) or from the transformation of those data into meaningful metrics (tactical consoles or the old good annotated maps). All those screens and quadrants are not to be considered BI systems because they support directly operations and are not used to identify the purpose of the submarine actions in that specific moment of the mission.
So, according to our view and for the purpose of this article, real time BI system provide an aggregated, transformed and enriched view of operational data to support decisions.
Real Time Exactly Means?
A true real time BI is hardly feasible because commercial operating systems are not real time operating systems. With the term "real time" we actually mean that the lag between a transaction occurring and its effect being available for consumption by users through BI tools, is low, compared to the lag experienced in batch systems. While it is ubiquitous the idea to update data once a day, shorter interval are becoming more and more common. At the lower end of this trend, when intervals become short (less than an hour) or very short (few minutes), the batch update blends with real time.
When business talks about Real Time, it is often satisfied with these lower bound latency levels. As you can see, this matches with the idea above that operational systems do not account as real time BI systems.
The Technological Approach
Implementing real time analytics in a transactional environment requires, obviously, a choice of the enabling technologies.
From a technology perspective the critical driver are transaction frequency and data volume.
Some applications like Telco or Web Analytics may have a very high transaction frequency. The transaction inflow may be metaphorically assimilated to a stream which properties are associated to.
This kind of high frequency data is usually handled by sw platforms designed to manage "feeds" of data. These platforms tap into some sort of APIs or messaging applications, sequentially calculating time based metrics and KPIs. They basically aggregate the last "x" transactions to calculate a time dependent measure/metric/KPI.
There are plenty of solutions on the market, each with its own flavor, that can implement this features. This is no longer new, today, but it is something that can be purchased off the shelf. There are dozen of web analytics solutions (Google and Adobe dominate the market) and just a little fewer generalists (Vertica at one end, Vitria at the other, some Oracle and SAP integrated solutions etc.).
Just a step below these challenging environments, there are other cases that may benefit from a more low profile approach and some intelligent design. Not all the companies have millions of facts to be processed every minute and yet, need near real time processing, being a lag in the order of magnitude of tenths of minutes to few hours, acceptable.
The Zero Footprint Solution
At the lower end of the complexity spectrum, we can simply query the transactional systems for up to the minute data. This solution is particularly viable when tools featuring a semantic layer like BusinessObjects, Cognos or Microstrategy are available, since the queries issued are likely far less straightforward than those issued toward a structured datamart, and hence not within the user reach.
This method may work beautifully and be a long term solution with a minimal effort and impact.
Obviously, there are a number of things that can go wrong.
The first issue is that, by definition, we are querying and presenting siloed data, the same siloed data that the traditional DW paradigm is designed to integrate and enrich. While these siloed data, presented in an interactive format are often enough to take real time decisions; if this were not the case, the tools mentioned above feature the solution. They all provide some sort of data federation at query time or data integration from multiple sources at report level. So, for example, the raw customer description deriving from the transactional system may be integrated with the much more sophisticated conformed dimension and all its attributes.
An issue pertaining to the technical domain that may jeopardize this effort. Operational systems database structures are often inherently unfit to handle the BI queries efficiently. They are optimized to return, add or update small sets of rows very frequently. BI queries typically span large table segments, aggregating a non negligible percentage of the total rows. These queries may either perform badly or hinder the performance of the operational queries. There is no zero footprint solution to this issue.
Replication is a database engine feature that synchronizes two instances of the same database. The master database is updated by the transactional systems, then it is replicated into a slave, which will be exactly the same as the source db. This one will be queried by the BI systems. Obviously the replication lag time is a key factor but modern solutions can reduce it to negligible levels.
In this way the two systems are isolated and the BI system can run its queries autonomously. Obviously, the possible requirement for federating these data with the DW will remain intact.
If query performances keep being an issue, though, a different solution is required.
A possibility to improve performances is replicating just the data needed for the real time decisions. This is a feature that may be included in replication or may be implemented ad hoc. If the latter is the case, it is highly advisable to rely on timestamps or last modified dates/times and not on Changed Data Capture, since hashing may easily become complex enough to slow down the entire process.
If the BI environment includes cubes, than cube processing times adds to the real time lag. Even though modern cubes are very fast at loading and processing very large quantities of data, they are hardly compatible with a real time environment. While is in theory possible to load small batches of data and just process those, this may easily prove to be a complex and unreliable mechanism.
A good alternative is to consider pure rolap cubes, where the cube just becomes a proxy to the relational database underneath. This is an obvious sacrifice of the most important (and many would say the only) feature OLAP has to offer, the pre-aggregation of underlying facts but may preserve a consolidated path to access data.
If Everything Else Fails
If none of the methodologies described above proves to be working for the case being considered, at last, it is possible to consider firing the ETL process at short intervals. This is obviously the most complex solution to adopt.
First, you have to make sure that every dataset being lifted is inherently coherent. For example, there is no point in updating the products at mid morning if they do not possess yet all the attributes that are required to perform the subsequent transformations.
Second, some metrics which are based on long term counts and averages may not be efficiently calculated according to an incremental logic. For example, if customers are segmented upon the long term pattern of their purchases, it is generally complex to update these segments incrementally.
Third, and most important, different users may receive different figures during the day just because they issued a query at different times. While this is perfectly acceptable, and indeed desired, for the metrics which drive the decision to be taken upon the real time analytics, it is a potential endless source of confusion for all the other information. For example, the monthly closure process requires to draw a line on various streams of information and managing it while keeping on updating all the rest is, indeed, very complex.
This article is just a short review of the alternatives available to an organization that is considering the adoption of real time analytics. The basic idea is that specific real time technologies are necessary in some challenging environments but they are not the most cost effective solution in other. In this case, a smart use of existing technologies appears to offer the best return on investment.