What Methodologies are Good for

I always had my personal approach to methodologies. What I think is useful, I take; what I think is useless or harmful, I leave.

That should be true every time you want to get something done, every time getting to the result is the main thing. However, more often than not, the conforming to the methodology appears to come first.

The methodology debate in IT has been an hot debate and Agile appears to be winning. I have nothing against Agile or any other methodology, I have a lot against abusing the ideas underlying them.

Every methodology has its take about how the project is approached, how phases should be identified, who are the stakeholders, who needs to be informed, which are the documents and their format etc. etc. etc. Each of them is useful and gives suggestions about how a project should be managed. The basic idea, however, is that you are able to break down the process into elementary building blocks, everything will be easier. What's the saying, plan you action and then action your plan. Do small, manageable, chunk of works and, if you do enough of them successfully, eventually the project will be successful too.

How to identify them, is open to debate.

Unfortunately, pretty much every methodology lacks two crucial insights to be really effective:

  • It doesn't tell what to do.
  • It doesn't tell how to do it.

What is the best solution to implement a specific dimension, to design a data mart or to broadcast a dashboard? No methodology is telling me that. You will say that this is the domain of designers, rather than project managers' but, then, depending on the choices you do, your "chunks of work" change. The methodology may tell me "first, choose the technology" or "do some design, first, then do something else", but the choice influences the domain of methodology itself.

Then, once I have my plan, the methodology is not telling me anything about how it is going to be implemented. How can I tell which are the best small choices that I need to make to complete the plan elements? The plan success depends on the quality of execution as it depends form the soundness of the original idea. Methodologies may offer a sort of broad guideline on that but the detailed information required to to do the right thing in the right situation, again, is not within the real of methodologies.

What keeps puzzling me is the amount of literature about methodologies (including this post, actually ...) while there is so little literature about the what and how. Better, this subject is left to the market players, who, obviously, have a biased view.

Mind that I am not advocating technology agnosticism; what I am advocating is a discipline about comparing technologies, patterns and practices like in "comparative law" or "comparative language and linguistics".

What I would really need is a book that tells me things like "mind, if you choose My SQL as your database, at some point you will have to tune the xyz database parameter for your environment. It is a cumbersome task because requires to understand this and that etc. In Oracle, on the contrary, you don't do any of that but you do this etc. etc."

This would be a real help in bringing a project over the line. A methodology can only tell you "assess your choices". Well, I think I really do not need that kind of suggestion, I would do it anyway.

So, if the choices you do affect what you need to do in a way no methodology can address, what is the point of having one?

Actually, there is no point, every methodology is reasonably neutral; the people applying it are the real factor. The project success doesn't depend on the choice of the right methodology but from the right choices done within the context of the methodology.

At the end of the day, the only reasonable methodology is:

  • Decide what it needs to be done
  • Do it

If methodologies fail to offer the crucial guidance to accomplish the crucial choices, than people become crucial.

There is no substitute to a person, or a very small group of persons, who have in their head all the subject matter expertise required to manage the project.

It is the "wide" knowledge I have already advocated as opposed to the "deep" knowledge that it is so popular today.

The most successful project leaders that I have seen were those who were able to move effortlessly from a database design session to  a stakeholders meeting going through an analysis session with the users.

These people used a methodology as a sort of frame of reference, to provide bits and pieces helpful to keep everything in order, but they do not hesitate to deviate from it every time their experience tells them it is the right thing to do.

Maybe we are all to much tied to pre-defined roles and personas than it is really necessary.

What do you think?

The Requirement Illusion

Once upon a time, in a previous life, I used to hire contractors and free lances to staff up the projects I was managing.
Obviously I needed to interview them first. I generally looked for generalists, not specialists; if a report designer needed to build a view server side, I always thought it was a waste of money and time to call for the database designer just for that. The roles on the project where almost always assigned according the customer organization, not the technology expertise.
The obvious consequence was that I was always recruiting fairly senior people, with many years of expereince under their belt; I wanted people who were able to interact with the customer and get their hands dirty at the same time.
One of the first questions I used to ask while interviewing was "What are the typical stages of a DW/BI project?". About half of them said "First, collect user requirements". Such an answer was invariably and swiftly followed by a "thank you for your time, we will let you know."

Expecting to be told what to do is generally a problematic approach to a project; in DW/BI it is quite close to an illusion. It is widely accepted that creating sound specifications in this world is extremely difficult because the users are unable to exactly define what they need so it is necessary to work in iterations, each time polishing the deliverables a bit more.
My point of view, however, is that even the deliverables produced by this process are an inefficient solution. The reasons are simple and twofold.

First, a deliverable that doesn't rise further questions is a bad one because it is not fostering the quest for insight. Apart from the legal statements, which have established rules and formats, you get a return form your solutions when knowledge helps improving and reshaping the business.
Second, you may deliver what you are asked for and yet not resolve the customer's issues. Our job is to resolve problems, not to deliver stuff. Maybe you are going to be contractually compliant but, as long as you are not fixing the issues, the customer will be unhappy. In the long run this is going to hurt as much as being explicitly in breach of the contract.

So, what is the the right answer to the original question? In my view something along these lines. First, get to know the people you are working with, understand what they do, what they are focusing on and learn about their concerns. Pay attention to assimilate the key elements of their language to be able to communicate effectively. Then investigate the reasons why they started the project in the first place, what they ideally would like to have, the position they would like to be in after being provided with some deliverables. On that knowledge, figure out a potential solution, prototype it and put it to the test with the users. Rinse and repeat till the customer is satisfied.


Emerging Complexity


Ideas like entropy, Darwinian evolution, energy are part of our language and they are often invoked in our discussions as images or metaphors. A bit rarer are images involving complexity as an idea, and emerging complexity is a hardly used term.

The term "emerging complexity" refers to complex structures being created upon simple components and simple rules. The awesome development that lead life from simple prokaryote cells to the unique biological architectures of man and other large animals is the first and most famous of the examples we could do.
Since I trained as an aerospace engineer, I also  like to mention
the fluid that starts like a perfect laminar flow out of a nozzle in the water, just to develop waves and vortexes downstream without anything apparently provoking it. At the end the flow motion will be extremely complex and, ultimately, it will dissolve in the larger body of fluid.
Even simple dynamic systems, governed by simple equations, may give rise to extremely complex patterns, often known as chaotic.

I really wish that a BI system, a MIS  considered in all its components, could be described by a system of equations. This is not the case. However, we often see behaviors that should be "linear" to become "chaotic" and then show a new complexity out of it.

As usual, let's make an example.
Let us consider the familiar sales space, in particular for an Internet generalist retailer. They sell different product sectors to a wide range of customers.
The first reporting being produced will likely be the usual better/worse by product and by product groups, with various time windows (Year vs year, ytd, last week vs current week, this week last year, this Christmas vs las years' Christmas etc.). Measures will be quantities and gross, discount and net values.
This setting, alone, may produce a lot of complexities which may be (roughly and not so rigorously) measured as:

Nbr. of Products Classifications. x Nbr. of Time Windows. =C

Not all the combinations will be relevant to the management, some will be just dummy elements combinations etc. so:

Cr=C x R  where R<1

Initially, only some of the combinations will be relevant to the users but, slowly, as the business goes on, each combination will get under scrutiny as the knowledge hunger increases and the overall complexity goes up as a result.
That is

R ->1 when time->infinite (better, simply grows)

When R approaches 1, typically C increases as well because other factors are added:

Nbr. of Products Classifications. x Nbr. of Time Windows. x Nbr. of Customer Classification =C

So C and R tend to increase because, naturally, there is always a growing need of information and analysis to keep improving the business.
The max value for R is 1, and the max value for C is determined by the number of attributes that can be attached to the sales and are used to analyze them.
In proximity of these max levels, then, something new happen. So far we have expressed the complexity as a linear combination of elements but when predictive algorithms or data science enter the arena, we start adding strongly non linear complexities to our system. In this case, it is difficult to attach a number to measure the complexity and I prefer to think to it as if it was a phase transition, where new elements enter the game and rewrite the rules.
And, obviously, we do not stop here since we may use those results to implement a suggestion engine or another of those extremely ingenious artifacts that affect our life on-line, thus increasing the complexity that we are managing to high levels.

If you think that all of this is derived by the simplest raw material, an order or an invoice, we may well see how growing complexity is the inevitable bedfellow of the BI/DW discipline. 

Today, two kind of reactions to complexity are emerging.

In some organizations complexity is simply ignored and not considered. Management practices like lean advocate this line of thought. It is intellectually easy and reassuring thinking that "we are focusing on fundamentals", but unfortunately there is probably a "long tail" of advantages that are being overlooked just because the effort to master complexity seems just too much to be tackled thoroughly.

The other reaction is to leave the complexity permeate the organization, let the single information workers deal with it by himself or within her small group, thinking that "people know" what to do if they are close enough to the issues. Unfortunately this is just introducing a new, extremely complex and rather uncontrollable factor to complexity: human judgment. Which is good as long as it is not used to duplicate analysis, do slightly different versions of the same thing, set local own standards, confuse numbers etc. etc. etc.

I think we should not be surprised, these are human reactions and, as such, are likely to take hold. It is no use to say that a more coherent approach to complexity is adopting the right tools and policies to implement and govern it.

This is what I wanted to say and I'd be delighted to have someone elaborate on this, like finding practical ways of calculating C and R. However, just let me know your opinion!



We Live in Exciting Times for Everything Data!

Do we? Of course we do! I have been in BI and data management for about 20 years now, and I have never witnessed a time with such an excitement in the market.

Not only we can do what we used to do with the traditional small data much better and much faster, we also introduced the new categories of Big Data and Data Science. To tell the truth they really aren't really new but the costs involved in dealing with the two has plummeted thanks to technology and cultural progress. This is making them increasingly common.

Sometimes, I have the sensation to be firefighting, trying to absorb and get command as quickly as possible of all the new stuff being thrown at the market. This is a difficult but also a highly enjoyable process; it is fun to learn new things!

However, I often come across people who are not excited at all; people who would prefer that all this progress were much slower or just went away. I have to admit, if I wear their shoes, they have plenty of good reasons to be exasperated.

Who are these people? Those who pay the bills; the CIOs, CEOs and other executives involved.

In today's confused and ever shifting scenario, it is difficult to make a choice for a long term investment. We are facing the same level of complexity and cost that we have been facing in the past with the choice of an ERP: large investments that will require years to be paid off and will orient the entire technology landscape in the organization.

I tell this immediately, I wish there were a simple way through it. There is none. At least, if you are an unbiased market observer. What we can do is trying to table some considerations that may help.


So, let's shift our point of view for a moment and let's look at the data market with the eyes of a marketing director of a medium sized retailer, for example. We realize that she will be more and more conscious of the potential of big data and data science to engage the customers and ultimately sell more. She has already conducted many exploratory projects that have returned good results, up to the point to realize that this data driven approach must be engineered ad embedded in the marketing department's operations.  She will go to the CIO asking for a new handful of the same stuff she had already asked for and the problems begin.

"Well, we did the customer clustering on a one off set of data, today it will be already changed, and we used a tool named Mahout on top of Hadoop, which is good but very complicated. Today we have some other stuff, for example a thing called Spark might do better but we haven't tried it yet."

"The sentiment analysis, yes that was cool, but it is based on a new Machine Learning cloud technology from MS. It is a bit expensive and unpractical, at least by now. But we could always redo the thing on a different technology. Also that one was just Twitter, but you need also Facebook, Instagram, some blogs etc. sure. That has to be implemented."

"You hired a half dozen data scientists and I know that they are amazing. The level of insight that they are producing is amazing too. They are working with R mainly. However, I'd need to buy them computers with more ram because they told me that they can't work with the largest datasets. We also really do not know how to save their results in the data warehouse so they are readily available."

"I know you would like the output of the predictive analytics in our reports. I understand that having the budget compared with actual and prediction for the end of the period is an insightful piece of information, but we are using a client tool to do the predictions and we have yet to understand how to integrate it in the morning data build".

And so on.

Today we are in the middle of a data gold rush, were evolution is constant. Traditional players (Microsoft, oracle, IBM, SAP) are bringing their solution to the market. New players are becoming big players (Cloudera, Hortonworks,Tableau). Every day a new data centered startup makes the headlines. A manager who has to take a decision about the technologies to rely on in the next 5 to 10 years, is deep in trouble.

At this point you would expect me to make a forecast. In a sense I will do but I think it is more important to establish a framework to evaluate how these new technologies are going to stand the test of time.

The first aspect to consider is having a clear roadmap made public and a tradition of mainly sticking to it. SAP is probably the most accurate in this area. Generally they plan many years ahead and steer gently. Oracle and IBM, to a lesser extent, behave similarly. Pretty much all the traditional big players are quite reliable on their roadmaps.  

Other players, however, conceal the lack of vision and planning, terming their behavior as "Dynamic" or "Market Oriented". They are less reliable and you may well expect that their technology stack is going to get old fast, forcing the adopters to take difficult decisions.

Another key aspect is the endorsement of the technology by different players. There is little doubt that Hadoop is here to stay, there are too many players now working on it or making it part of their offering. The picture is different for all the Byzantine stacks built on it, some are de facto standards (pig, hive, sqoop ...) others are just bets, or stunts, to try to fill a market niche. In this area, too, you can make the difference between seriously developed complements, developed to tackle a market requirement (Spark, Impala ) and other components which nobody felt the need for (Why should I need a thing like Kafka? ok ok  I know, I know, it is just for fun ...).

Finally, the direction the market is heading to is going to be an important indicator. It is a bit complex to keep up with the polls and the inquiries which feel the pulse of the market and many of them may be affected by the "cool" factor about a technology. Some other may not be completely unbiased and they are of little value. A smart way to keep track of where the market is heading is to analyze the stream of job offerings and the technologies which are mentioned in it. Being actual job openings they are showing in "real time" what other organizations are doing. 

So, there are some ways to find a way through this forest of names and technologies to make a choice that is going to stand the test of time. Waiting for the market to settle, is not going to be an option because the risk is being put on the back foot by the competition that is already doing something. So, best of luck for your choices.

After all, we are paid for this, aren't we?



Too Much

too much.jpg

I think that the BI world needs a serious shake since I am seeing a well established trend that now it is going entirely against the interest of the organizations being served.



  • Self service tools like Tableau or Spotfire are becoming more and more popular albeit not being based on a strong and controllable semantic model.
  • I see business users trying to learn how to model in DAX or the M language
  • Tools like Alteryx and Lavastorm put the ETL in the hands of the analysts.
  • Data scientists join the business and ask for flat files and consider normal spending the 90% of their time cleansing data.
  • Tools that have never been scrutinized by experts are being used by the business and introduced as a fait accompli.

In general, we are seeing that the analytical functions are sitting more and more into business areas and they are replacing the services offered by traditional specialists. Data manipulation and information management is being popularized and dispersed, as if handling raw data were something that could be mastered by the average knowledge worker. This is happening because traditional solutions, managed by IT, are not supposed to give the users the freedom to pursue their ideas and they do not adapt fast enough to the changing business conditions.

I wonder if I am the only one who is seeing that this trend has made too much headway in corporate culture.

The deep flaw in this "modern" approach that should be blatantly visible to anyone is the fact that now we are building very low efficiency information management systems.

Try to do an exercise. Use activity based costing to evaluate the cost of managing the information in a system like that and compare it with a traditional solution. Try comparing the cost of, say, 200 analysts and data scientist in a "modern" solution with the 20 analysts/DS and 12/15 professionals in a classical configuration (and the professionals are going to be there anyway) and then calculate the difference.

So far, I have done a few of those exercises and the answer was never well received by the people in charge of the money. To avoid some minor complications connected with IT operations and project management, organizations are throwing away enormous quantities of resources, burning Full Time Equivalents in the attempt to manage the way the information is consumed.  A "modern" arrangement may cost three times as much as a traditional one, but as far as the money is coming from different budgets within the organization, everything looks right.

In addition to the costs,  we also have all the traditional issues like, duplication of efforts, lack of a single version of truth, processes out of control, decisions made upon not certified / not certifiable data, wrong data disclosed with legal issues arising etc.

The counter argument to this proposition, is, of course, that the issues connected with a centralized IT approach are still there and the higher cost is just the cost of sorting those problems.

The point is that it is in not a law of physics that IT must be slow to react and it must tie the users' hands. In my experience it is just an organizational problem, and a quite simple one indeed.
In a usual BI unit you have developers, Analysts, BAs, PMs, Architects etc.
Instead of fragmenting the functions like this, try having individuals who are embedded into users groups (I call them BI Officials) and have the responsibility and the full authority, from source systems extraction, through databases to the last of reports, to keep those users happy and you will see the difference.
IT is slow because there are plenty of people who have to do their small piece of job to satisfy a request and an entire management system is in place to coordinate these efforts.
Put everything in the hands of a single skilled professional (eventually with an assistant or two) who has the full authority to take decisions on the spot without the need to consult anyone and, by magic, all these issues disappear.
It is a bit difficult to find such  a professional but it is just because the market goes in another direction. Obviously you will need a manager to coordinate the officers' work and someone to keep servers and databases healthy, but that's all. 

However, even those in the BI industry are starting to ask themselves questions. I had the opportunity to sit in the same room with one of the founders of one of the most successful of these "modern" vendors and he, privately, admitted that they are rushing to add enterprise features because these issues start to emerge, bids are lost because of the lack of a solid enterprise architecture and this is limiting the growth.

So, after all, I am not going against the common wisdom without a piece of reason ...


Being short in life is not advisable, there are statistics that say that your paycheck has a stronger correlation with your height than with your studies and experience. Unluckily, it is one of the things you can do very little about if the fate was not good enough with you.

On the other hand, having a "short" data warehouse may help a lot.

In the classic data warehouse paradigm, an integral part of the process is to extract and transform the data to populate structures adequate for querying.

It is often reported that building and maintaining the ETL process is the most resource intensive technical process among those necessary to build a BI environment. The flip side is: this also means that this is the process where the biggest gain in efficiency can be achieved.

Many vendors in the last three decades supported this view. The wave started with the ETL tools that (almost) avoided writing code, followed by the tools that let you design a data warehouse just conceptually while the tools takes care of all the rest. ERP vendors all provide extractors or other solutions to simplify the ETL process.

This paradigm is loosing some relevance with the appearance of in-memory solutions but it is still one of the focal points for DW professionals.

So, what is the most cost-effective approach to ETL? What is the most efficient initiative to reduce the amount of resources devoted to ETL? 

Simple, keep it short.

There are plenty of ways to measure software complexity. Just try googling it and you will receive pages and pages of academic articles. For our purpose, let's just consider the number of objects (database fields) that are involved in the process multiplied the number of times they are invoked in the transformation. It is a very simple measure, roughly correlated with the actual complexity but good enough to show the point.

The amount of objects that are required to correctly model a business is pretty much given. You may implement your marts in steps or deliberately starve the business but what is needed for the business to work is generally well identifiable. There is not much to do on this side. On the other hand, the transformations that you may need to turn your raw data into something consumable by a front-end application may greatly vary.
Building these transformations is the most expensive stage, the one that introduces more complexity, consumes more time and it inherently creates the lineage problem.

There are two approaches to simplifying the transformations.

The most common raw data for a classical data warehouses, are based on facts derived from documents: invoices, Orders, bills of laden, warehouse picking notices etc. There are not many ways to model these documents: they all feature a header/footer, with some attributes and some quantities defined at that level; and a body that, generally contains rows, with other attributes and measures.
Mind that that representing the documents in this format is a sort of "natural modeling", since it embeds in data structures the way they are perceived and "lived" by the business.
The difficulty rises when, in the source system, they are not modeled as a header/footer logical data construct but they are dispersed in different tables, sourced from logs or queues etc. In this case, the integrator job is really uphill. Obviously there is no technical shortcut to this but, on the project management side, the DW/BI team may ask to the people responsible for those applications to provide themselves the header/footer structures required. This is generally advisable since the people managing the applications are also those who better know how to extract and transform the data and it is easier and less resource intensive for them rather than for a team which doesn't have a very specific domain knowledge.

The flip side of the coin, is that you will immediately need a process contract with another team and a technical contract among different applications and platforms. The contract details are often overlooked: what to do when a new field comes along, what is the tables update policy, who is in charge of signing off updates? Once again, instead of just a technical wizardry, what we really need is a solid process ensure a constant flow of reliable data.

At the other end of the DW process there is another approach that may greatly simplify the ETL process and the need for integration. Some large BI suites feature a back-end layer including a data federation engine and a semantic layer.
With data federation, the consumer can be shown a single data structure that integrates physically separated data as if they had undergone the process of having been copied, joined together, cleaned and saved to a table. It is generally easier and less expensive in terms of development, maintenance and computing resources to federate two tables rather than maintaining the appropriate ETL process. 
This doesn't mean that data federation is easy and ETL is hard, but the balance in most cases goes in favor of federation, especially if it is an out of the box feature like it is for some BI suites. 
The semantic layer, on the other side, embeds rules and descriptions that may let the ETL process stop short of some final operations, thus becoming simpler. A semantic layer may also include complex calculations, relieving the ETL process from the time and resources necessary to execute them. The calculation burden is moved at runtime and it is executed only if necessary. This is a semantic layer's virtue that it is often overlooked while considering which BI suite to use in an organization.

We can summarize the two point above in a simple rule: your ETL process must be "short", relying on one side on pre-digested input from the data producers and on smart tools to consume data on the other side.

So, sometimes, being short in life is not that bad, after all.

See ya.







I am borrowing Mel Gibson's film title to critically examine the mystique of collecting user requirements. While this practice makes perfectly sense in transaction systems' design and as an overarching framework of your data management initiative, in BI there are some interesting twists. 

It is considered a best practice, every book on methodologies give for granted that this is the place to start. Even data scientist suppose that they should start their efforts upon a specific business question. This approach is widely used because it is considered the only sensible one. Risks on the user side are minimized and the BI professional has a comfortable, stable, point to start from.

However, it is widely accepted how, when it comes to BI and DW, it is difficult to establish clear requirements. An old adage says that projects are "done once to get the answers required by the users, and a second time to let them know what thy really need". This aspect is widely discussed in literature.

The reason for this is quite simple to pinpoint. Designing a transaction system implies translating into an IT system a process that already exists and can be described in detail by the users working on it. There is a high, but limited, number of ways to manage invoicing; with a good dose of patience, you can describe them all or, at least, all those which are of interest for a project. In BI, user requirements describe reports layouts and business rules to calculate measures; but as soon as you have created them, they start generating new questions that require new reports or entirely new data structures to be answered.
An invoice, when is done, is done; a report is never definitely done. Transaction systems mirror processes and actions, BI systems mirror, and influence, mental business maps, a more elusive substance.
This issue is often addressed by some flavor of agile methodology, where the work is broken down in very small bits, and it is done in close connection with the users. These methodologies generally work and those who do not know better are generally satisfied. Obviously they shouldn't.

As already explained in other posts, BI is about creating a business model that could be used to do analysis and, like every model, to make predictions.

If we start from this point, the logical consequence is that we have to create data structures that represent the fundamental facts and master data of the business that we are targeting. These should exist in an easily consumable form independently from users' requests. There is no point in modeling invoices and leave out the invoice operator field just because nobody asked for it. Obviously this approach should not be taken to the extremes; there is no point either in modeling the annual leave request modules if the project is focused on sales.
In a more general perspective, the overall architecture should be rich enough to sustain the scrutiny of the users and bring them real insight. And, mind, users won't ask for that the first or the second time; it is the BI professional who must be experienced enough to understand what will likely be necessary and include it from the beginning. There are specific architectures, relational or not, designed to achieve this result (I am not covering these architectures here, this is food for another, scientific paper like, post), where the necessity of changes and updates is minimized and generally occurs only when the business changes substantially.
All of this may seem the revival of the old monolithic DW architectures that used to fail in the past and we are all happy of having left aside. Obviously this is not the case. Projects should be organized "vertically", that is should be aimed at providing something to a group of users: domestic sales, international sales, logistic, finance etc. DW structures should be generic and not strictly related with the original systems they are sourced form; they should mirror the business model that, in turns, addresses the business mental image in the head of the users.

This kind of structures can, and should, be designed with minimal input from the users, whose vision will guide the BI professional to defining the scope of the project and the kind of answers that are going to be provided, but, critically, NOT the details. Thus the purpose is covering an entire area and not to answer a precise business question.

The obvious objection to this approach is "If the users are happy with something, why shouldn't I provide exactly that?", the simple answer is that the users think that they will be happy but they wont. In more formal terms, it is way less resource intensive, hence cheap, designing the right architecture the first time, something that will require limited servicing, rather than going over the same points over and over again to accommodate a never ending stream of further requests and refinement.
Doing this, it is obviously difficult; it requires a lot of foresight that comes only with experience. Ideally, the BI professional should know the users job, from the point of view of the information involved, almost as well as the users.
Acquiring this knowledge should be at the hearth of BI professional growth effort because, in terms of effectiveness, it is much more important than learning yet another  SQL dialect.

I am doing my best to be controversial and obnoxious, please let me know how successful am I!




Engineering The Information Flow

The Role of User Generated  Content

In many organization, even those with a developed BI landscape, the notion of engineering the information flow to support processes and decision is not entirely understood.

What I am going to describe in this post is how processes stimulate the growth of the organization's information assets and the healthy life cycle to manage it.

How the information process is born

It is often thought that the process of providing information through BI is quite straightforward: users have requirements and BI professionals/analyst satisfy them. Since every answer, in this environment, is just the beginning of another question, then the loop will start again. In this way we have simply moved the point of use of BI closer to users than when it was with IT. Self service BI tools, now widely in use, reduce the number of cycles but just for simple and quite standard interrogations.

In a more modern and structured environment, though, every group of users is provided with a set of reports, datasets, dashboards or interactive reports (and any other form of consumption you can think of) that covers the entire breadth of the information available that pertains a group of users. Users will chose the information they need from the information available in a comprehensive "conceptual" menu. This is obviously the environment where self service analysis can thrive and prosper.


While the organization processes evolve, the information process supporting them evolves too.
In a structured environement, users will start from the information available, will "build" something, "hack" together something else, add some user generated data to get some results.

For example, a marketing department may produce, on the basis of available customer data, a model to increase sales in a price-competitive market without denting margins by pushing the customers to a different purchasing mix. 
At the beginning this model will be "on paper" and it will not be integrated with any system. It will be put in practice with ad-hoc operations and tested to be effective. 
Once the effectiveness is confirmed, it will become part of standard procedures, part of a process or a process of its own. At this point, the model must be engineered to be part of the information provided to actively support its development. Following the same example, the model might require to e-mail the users eligible automatically and e-mailing campaign result should be made available to the BI users. 
What was calculated and reported once on an ad-hoc basis, now needs to generate automatically customer attributes to be mapped in semantic layers, product discount schemes available for analysis in standard tools and feed those data to the transactional systems that operate the scheme.

The cost of not engineering the scheme (notice that it would not have been done according to the traditional view until the users explicitly asked for those data being engineered) would be an increase of manual operations, information entropy and headcount to manage it. 

What is described here is a healthy cycle where the business initiates a class of analysis and than it is engineered and implemented in the main flow and becomes part of the organization's information assets. 


The enhanced information assets are now part of the information flow and their benefit grows beyond the intended purpose they were created for. They may become part of the factor composing other departments critical information assets or high level business views.
Sticking with the previous example, the execution of the initiative may be immediately known by the shipping department, which is downstream in the cycle and will be required to fulfill a different order type.
The same information may become part of the revenue drivers analysis, which is, generally, a finance and executive domain.

The Paradigm Shift

This probably doesn't look particularly revolutionary, after all it is basically good sense. However, there is a paradigm shift hidden in the approach described above.

In the classic approach, information sits somewhere until someone thinks that it may be useful. In the approach described now the information provided to users is all the information the need and ask plus all the information they will likely need in the future. It is part of the BI professional job to identify which information might become useful in the future for which users.

Part of the extra information  is surely going to be derived from the classical sources, but as the complexity of the overall system is increasing, an increasing amount will be the result of the engineering of users' analysis. Every addition provides deeper insight and more business specific decision making elements.

As a consequence, BI systems tend to start with simple business models that shape the information content, and evolve toward an increasing complexity, and, potentially, a better decision making instrument.

See you next time!

Who is the project co-manager?


In this post we go back to describing the professionals who live around a BI project. My ambition is to be of help to all those who actually have a project to be brought over the line.

Under the term co-manager, we actually classify two different characters, depending if we are in a client-consultant setting or the project is internal in an end-user company. I purposely do not use the terms “client side project manager” or “business side project manager” because, there can be only one project manager/leader; the weight of command must never be a shared burden..

In a project run by a consulting firm for a customer, the co-manager is the client liaison officer. She is the person who is the official contact point with the customer; every query and every communication must pass through the co-manager unless there’s a specific different arrangement in place. A consulting project should not even start without a well-defined point of contact, its absence guarantees that the project will soon head toward some pointless direction, overstating some goals and understating equally important others. At the root there is the fact that a BI project is a business project that requires some direct involvement from business people. The project manager, usually, does not have any direct authority on those people and must not appear to be begging for their help. Often the co-manager does not have the authority to take all the project decisions but she must have enough knowledge of the organization and a personal contact network to identify, for every specific case, the correct course of action.

So, the desired characteristics for a co-manager in client-consultant arrangement are:

* Personal empathy between the project leader and the co-manager

* Good general knowledge of the issues connected with a BI project

* Excellent knowledge of the organization affected by the project

* Good network of interpersonal relations within the organization

Her principal tasks will be:

* Arranging and coordinating all the project activities involving people from the customer

* Manage the communication between the consulting team and the client at every level

* Manage the project internal communication, expectations and change, with the help of the consulting team.

For an internal project, though, we will likely have a different arrangement. The project manager/leader will be likely the company BI manager and the co-manager may be either:

* The manager whose unit will benefit the most from the project

* A professional outside the hierarchy which is tasked to act as a coordinator

The first case is the easier to deal with. The relation between the project manager and the co-manager will resemble the relation with a customer. Since the co-manager has full authority on the business resources being involved, if the two succeed in having a good personal interactions, few problems are to be expected. If the co-manager is some sort of professional, actually outside of the hierarchy, the situation falls back on an arrangement with similarities with the one described above for the consulting.
In the meanders of company’s organization charts, sometimes, you can find people of exceptional value in low ranking position. This is due to the fact that, broadly speaking, networking and affiliation are much more relevant to career than competence. These people, though, make the ideal co-manager because they are generally well known and esteemed, but are low enough in the ranks not to be seen as a potential menace by decision makers.

An experienced project manager does not spare any effort to set up the best cooperation possible with the co-manager, who may become a terrific asset.

See You!

The End of the Year

It is here again!

This was a good year from different angles, and I really needed it.

From the blog perspective, I thank from the deep of my heart, all my readers; you are the reason why I keep writing.

I also need to thank some people who wrote some very kind words about my blog.

ng data featured my blog among the top BI blogs, way too kind!

BI Software Insights classifies UpStream as one of the most influential BI blogs (if it only were true ...)

My old friend Peter James Thomas kindly mentioned me again on his blog.

Finally, my old BBBT friends,whom I will never be thankful enough to, now feature my content.

Have a Merry Christmas and a Happy New Year, my friends.

See you soon!


On the BI Platform, Again ...


I am working again on the theme of BI platform, since in my view it is becoming increasingly important. This time I am a bit more direct.

I have been to a conference, recently, where one of the new and rampant names in BI landscape was demoing its product. What struck me was one of the initial slides where the "Pendulum of BI " analogy was described. 

In the beginning (Just the end of '90s) there were many different BI vendors. Then, during the '00s, the pendulum swung toward a coalescence into few large platform harnessing all the elements of BI. Now the pendulum is swinging back again under the push of some innovative newcomers because, and this is the key element "Big and complex suites ultimately do not deliver what users want".

Well, what do users want, then, that is not delivered by the heavyweights?

  • Big platforms free users from SQL and MDX by their semantic layers while more "user oriented" tools always start from some sort of query (or from an Excel worksheet, that requires a query anyway).
  • Big platforms can federate different data sources while tools just integrate them client side.
  • Big platforms integrate the complex security required by an enterprise environment in a (relatively) easy management environment, tools often rely on database security.
  • Big platforms manage the way information is distributed by scheduling and profiling engines, integrated with security and event driven, while this is not always the case with some of the modern tools.
  • Big platforms let the workload being distributed among different servers or users workstations as the system designer sees fit, tools are end users tools.
  • Big platforms have more than one way to interact with users, tools have just one client.

I could go on but, is there anything that the users do not want in the list above? Is a coherent and integrated environment something that the users do not want? Not in my opinion, actually.

What the users do not want is a badly designed integrated system, that does not make use of all the right features. They do not want slow updates or rigid rules that force them to go back to IT for trivial tasks.

They want official data, certificated by IT, to be combined with other data on the fly on their clients.

Some of the "modern" tools really shifted the paradigm in term of pure usability and the capacity of doing data discovery and analysis. They produce a refined result faster than traditional clients, but they do so at the price of loosing control and coherence in the company data landscape. Even though the marketing message is not stressing the point, all the runner ups in the BI market are rushing to add enterprise features because they are well aware that this is what they need to conquer some of the big contract with the large organizations.

I will get back in the future to this subject, because in my view it is often overlooked. In the meanwhile, let's open the discussion! 


When Change is Relished

There is a huge literature about change management, but the bulk of that is assuming that change is hard to accept and it is being opposed. When change is embraced with too much enthusiasm is a hardly covered issue, but it may cause some headaches as well. Let's try to partially fill the gap for BI.

Business is not always opposing change, sometimes it relishes it. When the business relish change too much, then this is an issue that must be addressed. Some would think that a business eager to embrace the changes brought along by a new BI system would be a favorable condition for a project but, indeed, this may create issues as nasty as those generated by an impenetrable audience.

These issues fall mainly under three possible categories.

Inflated or unrealistic expectations

If the business expects that the system will cover more areas than anticipated, or they expect that it will fix more issues than those it is designed to fix, even a perfectly working and well performing system won't match their expectations. This will start conversations on the effective ROI of the initiative and, potentially, it will lead to cuts or cancellations.

An example is a sales datamart that is expected to cover also marketing but it does not contain typical marketing information like customer segmentation or demographic data. In this case it was not made clear what was covered into the datamart and how/when the marketing issues are going to be covered.

To avoid this kind of issues, it is necessary to:

  • Clearly communicate the project planned progression, possibly with planned delivery dates. Make clear that the delivery date is not going to be a big-bang but just the start of a process that will take to the point when the old practices are no longer required.
  • Illustrate the reasons why some areas are covered before others. This decision is generally taken while setting up the project and must be well motivated and shared. There must be an owner for this decision, who can decide to vary this progression upon different business conditions as priorities shifts.
  • Give a real look and feel of how the system will look like when the new information will be made available. Continuing on the previous example, marketing will be shown where the new customer data will come up and how they are going to use them ("look, your customer segments will appear among these other attributes and you will be able to slice/dice/filter/group etc. like you do now with other attributes")

Improper system use

There are various possible improper use you can make of a BI system. Basically for every improper use you may try to list, enthusiast users are going to find a new one that was never thought of before! At the root, however, there is often a misunderstanding about the information being provided.


BI data may be used to validate transactional system data.
Example: finance is getting sales data from BI and from accounting systems. They find differences and they try to judge the accuracy of one of the two from the other. While a well designed system has known rules to identify the transformation required to move from one to another, these may not be well known to the users or they are too technical to be actually available in the front-end applications.


BI data used to feed transactional data.

While the data warehouse may be the place where some data exclusively live (example, market data acquired from a third party), in general it is not to be used to complement missing transactional data. Example: clients are up sold from a channel to another and different CRM system is used for this channel. Demographic customer data should be acquired again and eventually updated in the DW rather than being copied from the DW to fill a mandatory field into the CRM system. Do the other way around, and you end up with no clear authority among the systems.

Real Life. the author has been involved in setting up the budgeting/planning process in a large company in the food sector. One of the planned measures was the returns value. Unluckily, for a poorly coordinated IT decision, returns value, in a form suitable for planning, turned out to be unavailable in some source systems. The decision was to use the budget to fill the missing actual. After that, everybody was happy about having met the budget so precisely ...


BI system features misused or ignored.

When the advantages of the new system become evident, some users may set out themselves to replace advanced features not yet known or implemented.
Example: in an environment where Excel report files used to be e-mailed manually to recipients, the same thing is done by saving the new system reports in Excel instead of using a proper scheduling/broadcasting feature.


All these cases produce inefficiencies and potential errors. These may affect the trust in the new system, even though they are, ultimately, user mistakes. They are often difficult to track and fix because they are not surfaced by the users. Part of the activities required to assist the start-up must be aimed at verifying that there are no misunderstandings on how to use the systems and that the context in which the information provided is valid and well known. 

When one of these occurrences is spotted, unluckily there is no alternative to a painful direct correction of the issue.

Disconnected initiatives

On the wings of the enthusiasm for the new system, groups of users who have been kept at the side  of the process may spawn personal initiatives that may potentially jeopardize the long term progress of the larger BI initiative. If the entire information flow within the company has been designed upfront, and the design is appropriate, then excessive activism from the users may introduce some sub processes that are not going to be well integrated in the wider landscape.

The increasingly powerful BI clients put the users in the condition of "running ahead", anticipating future implementations. For example   it is tempting to cluster your customers on the client side, and make it the official clustering document, just to discover that clusters can't be turned to attributes available to others.
Another classic case is the warehouse who is given access to the tools just to discover that one nightly load is not adequate for them and they need real-time access to their stock.

Depending on the initiative being taken, a direct correction may be required. However some of these initiatives may be leveraged to anticipate some analysis and feedback. They are actually a good example of what the users want on the subject.

I briefly covered this subject, according to my experience. I am sure that there is a lot more to be said. Please let me know about your thoughts and your experience!

Have fun.


Where we bust a myth, avoid a pitfall and embrace a vision …

There is a lot of literature about how to design a data warehouse. There are endless discussion on the internet about the design principles to be applied. There are fierce battles on which ETL philosophy is intrinsically better.


There is nearly no debate on how the information should flow within the company.

There is nearly no discussion on how the control process should be supported by this flow.

There is nearly no discussion about business intelligence platforms.

Better, we see such discussion on vendors’ brochures but, among professionals, they are often dismissed as marketing speak or a minor aspect, just to be thought of when you are tired of doing the real, meaty, stuff.

But let’s start from the beginning …

It’s a no brainer!

At the beginning of Business Intelligence, in the 90’s, sometimes you could hear discussions like this: “what’s really important is the data warehouse, how you collect and harmonize your data. The tool to consume them is not really important. You see? All those DSSs built without a DW to support them? They are all being decommissioned. Data is what matters. And cubes. Lot of fat, meaty but shiny olap cubes because users want to drill down data”.
Apparently this attitude somehow managed to survive while the entire BI landscape was evolving and shifting. After all, everything users ask for are reports or to drill/slice/dice/filter if they are analyst. After all, when we collect user requirements we are told about reports made so and so. The top of weirdness are datasets to be consumed mainly in Excel. And we all work with Agile right? Where everything you do is a user request. To give all this stuff to the user, there is not much difference among the different tools out there. One is graphically excellent, the other produces very good printed report, the other works with cubes etc., but at the end of the day, it is always the same stuff, is it? And some of them cost so much!
Yes, in the last two or three years those freaky blokes called data scientists came up, trying to do all sorts of alchemies on their laptops with tools like R or SPSS; but they just ask for flat files, nothing really new from our point of view.

The Naturally Evolving Architecture

Bill Inmon coined this term in 1992, to illustrate, in his seminal work “Building the Data Warehouse” (Did you read it, did you? Because if you didn’t, stop reading me and read him!), what was going to happen if reporting was generated in the way that looked obvious at the time, by extractor programs. The final result was a maze of uncoordinated and inherently irreconcilable information living and prospering in every corner of a data-centric organization (Yes, data-centric organizations are not a thing of the 2010s, they always existed, but we will talk about this somewhere else). The lack of reliability, the inherent complexity and a truckload of other issues called for a new paradigm. That new paradigm was the Data Warehouse.

Now, pretty much no-one is still relying on the NEA and everyone is recognizing the necessity of a DW. Apparently, though, no-one has realized that the Naturally Evolving Architecture is still with us, and it’s alive and kicking, it just shifted a bit ahead.

The Naturally Evolving Architecture

To understand this point, let us wear for a minute the user’s shoes and sit on the side line watching the entire process from the outside. Let’s imagine not to be very technical but to know enough to have an informed global view.
You see the source system, than some magic happens (well, actually the ETL process) and the DW is built every morning, than on top of it sits an olap cube. Then, you get some data (mainly from the cube), manipulate and enrich them (usually in Excel), distribute to some other people by e-mail or a shared drive etc. As a user, I see my job as the data extractor, enricher and distributor to real consumers, those who actually identify action courses upon those information.

This looks normal, is it? Is it normal to do this day after day, week after week, month after month?



This is just letting the things go as they naturally go if everyone is doing what she is naturally thinking to do. It isn’t really that different from the time when data were collected and organized on paper.  

The Comedy of Misunderstanding

BI software has evolved deeply in the last two decades (yes, I am that old), from desktop tools through the web revolution, spreading to mobile platforms; visualizations become more appealing and powerful, with more and more components, under the push of some young visionary companies who aimed to a more personal BI. The pendulum swung from large BI suites to more lean clients. Now the focus is on consuming big data in an insightful way, with little or no overhead for the user.

This is the story that is usually told when an otherwise hectic industry stops for a moment and looks back at his past. And it is true, but it ignores the single, most important, feature that characterizes a BI product: its’ infrastructure.

The traditional large BI suites all rely on a foundation that provides a range of essential services, like user definition and security, scheduling, a safe repository, data distribution, messaging, social features. Most critically, they feature a layer of shared data access services that provides a common vision throughout the organization. This layer may include federation or other integration forms, while advanced clients leverage it for lightweight integration at user level.

So, the effective deployment of one of these suites in a complex organization may heavily change the process described before.


In respect to the model described above, we may have deep differences.

  • For example, we may have user controlled, automated report mass distribution.
  • Reports that remain interactive and contain data may be served to the analyst users, potentially greatly alleviating any query performance issue.
  • Alerting systems may reduce or eliminate the need for entire report families.
  • Dashboards reduce the need to supply executive information.
  • Data integration performed at client level, with the possibility for the users to directly engineer that reduces DW maintenance overload.
  • Tools that can produce data based presentations help analysts in their “storytelling” duty.
  • Google style interrogations give the casual user a lot of information at their fingertips.
  • Data federation may shortcut substantial chunks of the integration process.
  • Semantic layers may tap directly into source systems to provide real-time BI.
  • Reports may turned into data sources for other reports and analysis with no IT involvement.
  • Social features may be used to foster healthy discussions on data. 

An example of a well designed BI architecture

I could go on but I think that anyone who worked with one of the large BI suites, and most of the smaller one,s can recognize the pattern. Each one of the points above is a distinct advantage. Everything that makes the process less time and resource consuming adds to the bottom line. Everything that turns data into information more quickly but still consistently, adds to the ability to respond to a changing environment.

If you have 20 people consuming information, you can do pretty much everything you want and it will not do any particular difference.

If you have 1000 people consuming information and you spare 15 minutes, every day, for each one of them, how many resources are you freeing for other uses? Well, you do the math.

Down the Feature Drain

However, the world is not a perfect place.

Many of the BI professionals who worked with a BI suite will agree that many of these features remain not utilized or underutilized. This happens, despite the potential described before; and the reasons for this happening are hiding in plain sight.

The BI vendor marketing bears a part of responsibility, since it is often unable to take a stance. In the effort to talk to everyone, it ends up talking to none. The message is always confused as it oscillates between “you can do exactly what you want” and “look at this comprehensive panorama made of 137 perfectly integrated different software modules that will revolutionize your company”. The real potential behind is often drown in the glare of shiny charts.

The bulk of responsibility, though, stays with us, the BI professionals. It is much easier to focus on the basics and forget the possibilities. Maybe because creating and maintaining a DW is such a taxiing effort, we find easy to just translate user requirements into a bunch reports, de facto replicating the NEA on the BI platform. This condition for satisfaction is also easier to be added to a contract, so it appears a sensible way to go; I have been guilty of this myself, sometimes.

We are also obsessed from the users’ requirements: every methodology starts with collecting user requirements, the capacity of translating them into technical requirements is considered an important piece of know-how. Unluckily, the users do not know what they want and do not have a clue of what they really need. In BI it is intrinsically complex enough to identify the “what”; that is the data that may answer a business question. Specs are never 100% correct the first time; this is a given.
The issue becomes nightmarish when we let the users design the “how”. The average business user will ask for a better version of what she already has; she will fiddle with the tools already in her command to find a solution and she will ask you to help with that. This will happen because it is not the users’ job to know what is available in the BI software market and, crucially, how it can be used to improve the way information is managed within the organization.
Users, questioned on what may be of help, will cover just the segment of NEA under their responsibility and will miss the bigger picture. They will likely ask for some little features or improvements, sometimes doomed to be utterly irrelevant in the overall BI strategy. Paradoxically, what is going to emerge from a non-educated user survey, are just ways to improve the NEA, making it more difficult to be eradicated.

Henry Ford used to say: “If I'd let the clients design my cars, I will end up with fast horses”. It is our duty, as BI professionals, to show the users that an entire panoply of vehicles is available out there and harness their informed contribution to identify the best software and the best process.


Hmmm ... I know I know, this is going to be controversial as it seems a spot for the big vendors. Well, I think this is the truth according to my experience; I am happy to be disproved. Up to you!

Business Intelligence and the Business Model

The business running like a clockwork. Is it impossible?

The business running like a clockwork. Is it impossible?

In my previous post, I said, en passant, that the purpose of Business Intelligence is building a business model, whose purpose is to return the KPIs required to assess business performances and to predict how these performances may vary in response to internal decision and external perturbations.

I had some private discussions on this subject, which I believe is central, that made me think that a clarification is required.

I am old enough to remember when, in the first Decision Support Systems, some vendors used to call their cubes "Models". At the time we were implicitly conscious that we were building mathematical models that were describing part of the business.

Even the figures in a simple report comparing sales vs target are the outcome of a mathematical definition that is built in the BI system in use. Actually, in my previous post the term Key Performance Indicator (KPI) was used quite loosely to name all the numbers that may be extracted from a business model and represent something meaningful for the purpose to control the business

The output look different from what is obtained from models in scientific research because presentation for the business is generally made simpler and graphically appealing. The process and the tools used to implement the model are different as well, because business requires a much higher ease of use and constant updates.

Nevertheless, there is a set of mathematical rules that link the raw data to the outcome consumed. I think that we can all agree with that.

The crucial point that is often missed is that the model, to be really effective, must encompass as many business processes as possible and identify the mathematical rules that link them. If we see the company as composed of linked processes, one change in one of them will bring along changes in others. The real power of BI is unleashed when I can numerically assess these dependencies. 

Since I produced 100 widgets and sold 50, my stock went up, my assets did it as well, receivables and bank account varied, the number of customers buying widgets as well, the post sale support calls went up and since I sold in Japan for the first time I had to hire some Japanese speaking staff and payroll thus changed, bringing along changes in my credit lines etc. etc.

Once these dependencies have been assessed and a sound business mathematical model has been identified, we finally get the possibility to assess the likely outcome of a management decision or the impact of a change in external conditions. Every internal project may be weighted against a sound estimate of its effects, thus reducing the amount of gut feeling involved in business management. I have been asked time and again by long sighted managers to build something like this, and planning and budgeting applications are usually the place where the model may live. This is the reason why I keep including Performance Management in BI as it is its natural extension.

If we compare this vision with the traditional BI view of "providing the right data to the right person at the right time", we see how the latter is really simplistic and naive.

What is the role of big data technologies in all of this? They just provide new inputs and a new processing power that may help to make the model more accurate at a cost lower than what was possible just few years ago.

What is the role of the data scientist in all of this? She is one of the assets required to design the model and to implement it in an engineered way.

The road that leads to such an integrate business vision is often long and hard. In my career, I did not have more than a couple of customers which reached a point where a consistent business model was in place. The results, however, were shining brightly. They could run their business with way less cash at hand, they cut all the business dead branches and all the new initiatives were accurately assessed so no more money sinks were created. This is evidence based management at its best.

I hope that this post might spark some discussion about the means and the purpose of BI. If someone among the professionals in the BI space would start elaborating along these lines, I will be absolutely happy.




How to Get it All Wrong Because of Big Data

Hadoop is no longer the elephant in the room

Hadoop is no longer the elephant in the room

I have heard, times and again, during the past year or so, a question that may appear naïve to the insiders. I have heard some managers asking themselves “should we create a real Data Warehouse storing all our data in Hadoop, replacing the old relational databases that we are using now?”


My reply is, usually, “where are you getting this idea from?”

One step back. The aim of Business Intelligence (as usual I include also DW under this umbrella term) is to support the control process. The aim of the control process is to create a model to describe the business in numerical terms, monitor the metrics and KPIs describing the business performance and, crucially, predict the effect of business decisions and external perturbations on those performances.

Hadoop is a piece of software, usually bundled with a lot of other stuff with funny names, that is able to store any sort of data (organized in files) and execute transformations or queries on them in a fast and efficient way. It does so by parallel processing, that is, sharing the processing burden with other Hadoops on different physical machines.

So, what has Hadoop to do with the business model mentioned before?

If you answered “nothing”, you are right.

Hadoop enables the model to be fed with an entire new class of data (sometimes called “black data” but generally known as “big data”) whose volume is big enough to be too expensive to be processed with traditional technologies. This is going to improve the model and to consider an entire new class of performance metrics but it is surely not going to replace its current implementation. Some of the most crucial data to manage every business are very “small” data (for example, the balance of bank accounts, which are numbers that practically everything has influence on) and there is no point in storing them in Hadoop.

Since the output of the model is always going to be a numeric output, and the quantities being the model input and output are naturally expressed in tabular form, Hadoop does not offer any decisive advantage over the classic relational databases in storing them and doing the calculations required by the model.

While, in principle, it is perfectly possible to rebuild a DW in Hadoop, there is actually no reason for doing so.
In addition, at least today, there is a certain level of impracticality that makes working with Hadoop a more bumpy solution than having a relational backend.

Let’s not forget that there are some other potential solutions that compete with Hadoop in the big data space and may offer alternatives not to be ignored, like Microsoft Parallel Data Warehouse, Oracle Exadata or SAP HANA. So Hadoop is just one of the new technologies that are enriching our world, a very important one, with a great future ahead, but just one.

So, the next time you hear someone asking “Should I replace my DW with an Hadoop based solution”, you know what to answer.


A Single Version of Truth - Reloaded

I already wrote a post about this topic in the past but this is much more interesting and pretentious than the old one.

She is supposed to be single ...

She is supposed to be single ...

It is often mentioned as one of the cornerstone benefits of Data-warehousing and Business Intelligence (DW and BI respectively), it is always depicted as the end of a bane, a sort of data management sacred Graal: it is the adoption of a Single Version of Truth throughout the company.

However, sometimes, this very idea is interpreted quite naively by the business. The upper management loves to hear that the numbers are exactly these and not others, so they are often given what they want without much hesitation. The truth is: the numbers may well have been different, depending on the people and the use they were assembled for.'

As an example, it is obvious that under the simple term "Sales",  there are many different definitions of the same number. Sales for salesman are the value of signed contracts, while the sales for operations are the orders invoiced. For finance, sales are the balance of a specific account while customer service is interested in returns too. They will all call sales what actually are different measures.  While sometimes business people find hard to understand this difference, it is clear that this difference exists. So, what do we mean exactly with the expression "Single Version of Truth"? 

The term "Single Version of Truth" identifies the ability to be able to decompose every result obtained from a query in terms of the results of another query.

In other terms, providing the Single Version of Truth means being  able to answer the question "Why EMEA sales for 201X here are £YYY and here are £YYZ ?"

The Scenario

The differences among different versions of what is supposed to be the same measure, arise from various factors. While we stick with the sales example, our coverage may be easily extended to be as general as possible. We are also supposing to actually have a central DW to refer to.

Do not be overwhelmed, after all, if the business did not have all these issues you would not be paid to fix them.

Do not be overwhelmed, after all, if the business did not have all these issues you would not be paid to fix them.

Different system provide the same data in different formats to the DW. A company, or a group, especially if they are the result of a recent aggregation, may have different systems managing the same process for those which were the old separate entities now brought together. The figures may come in differently from different systems. For example one systems may feature tax and values separately while the other has single value and the tax percentage. Order header level measures (shipping expenses, discounts etc.) may be provided as special products appended to the order or in a separate table being extracted in an autonomous flow etc. In short, data format may be, and in general is, different.

Data may be not homogeneous from the beginning since there are two or more radically different processes at their source. In this case they require different dimensional models to be described, with different definitions of measures.
For instance, consider tractor manufacturer: it will feature two utterly different sales processes: new tractors and servicing. The former will deal with lead times, personalizations, credit etc., the latter with parts availability, licensed resellers etc.
Even at grassroots level, if we compare a paper invoice of the two businesses, they will look massively different, the only thing in common being the fact that they both have an header and some details with numbers. 

Trying to apply the same rules to these two processes is obviously meaningless. For example, including "Returns" as a negative to sales total, has  the clear purpose to identify a commercial margin; a defective part may be returned but a defective tractor is never returned, it is fixed for free, bearing a cost for the supplier, while replacement is an unlikely event. So the new tractor sales measure definition need a rule to account for these occurrences and may not mention returns at all.
In short, below the same measure there are different processes.

The dimensions used to slice and dice measures may be subtly different than expected by the consumers, though having the same name and look.
For example there may be different warehouses which shipments are coming from but they also are part of different companies controlled by a central warehouse holding which in turn is part of a consolidated group.
So, shipments may be associated with the physical warehouse actually shipping it or to the company with the same name running the warehouse. Consider an emergency shipment, where the manufacturing plant is instructed by a warehouse to skip the usual procedure and ship directly to the customer. It will be associated to the company but not with the physical warehouse. The finance controller will be interested in it but the field manager not.

Another dimension that can easily give rise to this kind of mismatches is the customer dimension. The customer to which a sale or any other transaction is associated may be of disparate kinds. They may be individuals, or family households, or single companies, or invoicing points with their addresses, or group holdings, or franchisers, or shipment points with a specific address down to the receiving door, or temporary company associations that buy once all together and later separately, or or or ...  Additionally, all the data pertaining to customers quickly grow old and, unless there's a sound MDM initiative in place to ensure that they are kept current and propagated, they become a relevant source of mismatches. Trying to view customers as an homogeneous category is often misleading and may lead to subtle confusions.
Finally, even simpler dimensions may hide differences in definitions. Let's consider for instance the Cost Center; something bought by a cost center for the use of another cost center, may well be classified under both of them depending on the type of analysis.

There is always a different bit, which is the one you want to chip away

There is always a different bit, which is the one you want to chip away

In a more general perspective, rapidly changing Master Data, when not managed centrally, are one of the hardest challenges to providing a single version of truth.

Transactions may not be obvious in their interpretation and may derive from different systems. Invoices will provide a different kind of information than accounting entries but both may provide sales. Beside, in every system, we have some transactions that are not easy to be identified. They are general manual transactions involving dummy products and customers and are issued for "miscellanea" or "Other" reasons. They generally consume a disproportionate amount of time to make sense of them and it is hardly possible to find consensus on their processing within the DW.

Last but not least, even when the meaning of every dimension and every transaction is well defined and understood, people do actually need different numbers. Different measures, calculated in different ways, are called with the same name and generate all the possible confusion. While names should be somehow adapted to the actual content, the critical question to be answered is: what makes this "Sales" number different from that "Sales" number? Knowing the answer, as stated above, is providing a Single Version of Truth.

The Logical Model

To actually being in the position to know the answer to the question stated above, we must shortly recall the nature of data-warehousing itself. While the formal definition of DW may vary, all the authors agree that there is going to be a "Transformation" phase (the T in ETL). This transformation(s) is required to obtain a physical model matching the logical model and an easy to query data arrangement. From these data, through further stages and further calculations, the actual content to be served to users is created. That is, we have two, ontologically different, transformation sets to be considered.

If A is a set of data, T the transformation and B the resulting set, we can express the ETL process like


Notice that in general the inverse transformation may or may not exist, depending on the transformation itself, that is, 


may or may not be true.

For instance, when some sales date are aggregated by month, in general there is no inverse transformation that could give us the sales by day starting from the aggregated result-set.

So, every time we run a transformation we risk losing information. If we are querying the A and B data sets separately, there may be no way to compare the two results to verify what the difference is.

However, the norm in the enterprise environment is to cover multiple areas with different datamarts, such that we will have this arrangement.



If we need to compare B with C, it just enough that one of the Ts feature an inverse transformation, let's say T2, such that a path from C to A to B may be established. It is then possible from the result set C to go back to the original dataset A by T2-1 and then to B by applying T1.

Comparing result sets derived from user queries, on the other hand, may be split in two cases: queries that derive from different datamarts and queries that derive from the same datamart.

In the former case we can easily see that we are just in the same case as the ETL queries, with just a step more, that is, the two transformations from the datamart to the user.

The latter cas is the more interesting. Let us suppose that we have two queries (SQL is assumed here, but the idea can be extended) which return different results. In principle, it is always possible to "morph" one query into another, adding or removing fields (objects), tables, joins (links) and conditions (where clauses or filters). There is one or more transformations that, by a finite number of steps, turn one query into another.

In a formal notation, if Qs are the queries, A the datamart and B,C the results we have.

Q1(A) = B

Q2(A) = C

If Q1(A) -> Q2(A) then B -> C    (I am not using a proper limit since we are dealing with discrete variations of a script, which is not a very mathematical entity)

In this process we expect to have

B + Δ1B + Δ2B + ... = C

That is, every discrete query variation returns an associated result variation. 

If we consider the value of a single field in B, ΔBs and C, we do expect that ΔBs are between 0 (no change for a given variation) and the aggregated value for the entire A. In the simplest case, deltas tend to become smaller and smaller and the results converge to C.

The business people generally expect that the errors may be fixed chipping away small pieces from your numbers.  Obviously this is not always the case.

The business people generally expect that the errors may be fixed chipping away small pieces from your numbers.  Obviously this is not always the case.

There are cases, though, when deltas, on the contrary, suddenly have a variation of order of magnitudes.

This happens when there is no enforced relation between two entities. To stay in SQL area, we are having a fan effect or a Cartesian product. If we meet this occurrence (and, of course, we are not making a trivial mistake while querying), the B transformation is not  going to converge to C.

This means that every datamart should feature rigorously fact tables with data at exactly the same grain if we want to be able to produce a single version of truth according to the meaning we have given before.

This point has a number of consequences on the overall DW design as a part of an enterprise wide BI solution and MIS system.

Proposal for a Real Life Model

Not surprisingly, the previous paragraph has reached a conclusion that is well known to every BI professional. The classic books on the subject that describe the star schema as the principal datamart layout prescribe a single table at the lowest practical grain. This has many other advantages but, as we discovered, is also essential to maintain the possibility to identify a single version of truth.

A bit less studied, though, are the consequences on the DW design as whole.

If every datamart models a single process at the lowest possible grain, what are the consequences for a DW where we have  different datamarts at different grain and yet we have to provide data coming from more than one of those?

The issue rising is pretty well known: fanning and filtering effects, may affect the results. These effects are implicit in the way SQL is working, but similar effects may still be present in MDX or DAX or any other query language. Also presentation tools often struggle to to show together data at different granularity.

All the pieces must fit together. Build a model of your business, not just tables!

All the pieces must fit together. Build a model of your business, not just tables!

This is a fundamental issue that can't be fixed by a simple methodological trick. These differences are inherently embedded in process nature and they leak into the data describing them. This demands a coherent approach.

To tackle the issue we can envisage a DW with a specific modified structure. Taking inspiration from the well know Kimball theory, we introduce, at staging level, an intermediate structure that we call "Process Mart". The Process Mart will feature:

  • Fact records featuring the lowest granularity possible, with a clear key field identifying it.
  • Fact record fields actually mirroring the quantities modeling the process i.e. summing all the homogeneous record numeric fields the result is still a number with an actual meaning.
  • The sum of a numeric field across all the records, is a number that still retains an actual meaning.
  • All the relevant dimensional and attribute fields.
  • As many key fields from other process marts as possible.

Data from every process mart, then, will be consolidated in datamarts; each datamart being derived by one or more process mart. For example, finance will be interested in sales and payments, while sales will be interested in sales and costs (i.e. margins), HR will be interested in people skills but also the cost of those people etc.

This "consolidation" process, may happen only if we can connect a process mart to another by their keys. The different granularity may be addressed in various ways, to be chosen upon the specific interest of the users the query tools that will consume data from the datamart.

  • The datamart may be at the most aggregated granularity level, aggregating the more detailed data in a single measure.
  • The datamart may be at the least aggregated granularity level, allocating the more aggregated data on the most detailed rows.
  • The datamart may be at the least aggregated level, with more detailed data being pivoted out
  • The datamart may be at the least aggregated level with aggregated data artificially brought to the same detail by adding dummy dimensional attributes.

This is not the place to discuss each of the options in depth; what matters us is their consequences on the ability to find a single version of truth. The crucial point to understand is that the process mart may be sliced in terms of the other process marts, thus preserving the traceability between the two. Operations like the aggregation or the allocation that, normally, would destroy the backward traceability, in this model remain invertible. I can, in every moment, identify the rows involved in the query using the key of the other process mart.

For example, let us suppose to have the orders process mart and the payments process mart. More than one order line may be covered by a single payment and a single order may have more than one payment, eventually not related to an order line. Than we have two datamarts, one including just the orders for the commercial back office and one combining orders and payments for the bank manager. Both datamarts have a "Sale" measure but the former keeps all the details about the orders, the latter has the payment detail and the order is just a property of the payment; no detail of the order lines is preserved.

Now, we have two queries from the two datamarts, both including EMEA sales in Q1 for blue widgets. They are returning a different number, how can we explain the difference? The process will require to:

  • Identify all the payments included in the query from the datamart.
  • Use the payments to select the orders involved in the orders+payments process mart.
  • Apply the same order selection to the orders only datamart
  • Apply the remaining conditions, where applicable, to the orders only datamart.

At some point, during the process, the reason for having a difference will come out.


This has been a lengthy explanation to try to ground in a formal framework some of the criteria that subtend the idea of "Single Version of Truth". This idea is integral to the business requirements that lead down to the datawarehouse path , and it is going to remain despite all the hype on "unstructured analysis". I hope that this post is going to help someone in designing better DWs and promote a conscious approach to achieving the Single Version of Truth.

Maintaining the internal coherence in the way described above is a process that stretches along the entire life of the DW since it will be challenged by the inevitable implementations but now you will understand that it is well worth its complexity. 

We can have reports and numbers even without a DW, but they will not be comparable and will be way less meaningful, useful and authoritative. If our DW/BI system is unable to do anything better, than why bother to have one?


Further Readings

The Ugly Truth About "One Version of the Truth" where you can have a general overview of the problem with a couple of big names references.

Is There a Single Version of the Truth? where Robin Bloor gives a nice, customer-centric, vision of the problem.

The Myth of One Version of the Truth This is an Oracle' paper that goes in further detail along the same line discussed in this article.

Why There Shouldn't Be a Single Version of Truth where Chuck Hollis gives us one of the classic articles on the subject.

Is this the End... ? This is a guy who is not understanding...

Turning the World

This is a very short post, waiting for more beefy stuff coming up next.


There are a lot of BI vendors who, in their proposition, support the idea of business adopting BI tools with no IT involvement.

In my knowledge the first was Tableau but now may others are coming up with the same approach.

These tools are supposed to free the information  workers to do their job, cutting out IT entirely.

This is because IT is slow to act and it is perceived as a rule enforcer, just capable of proposing big and costly programs to address any business need.


Well, I have news for you, information workers, the IT is more than happy let you do all your analysis yourself. All those technical BI people prefer not to deal with all those business users’ requests that, from a technical perspective, are nearly always the same and look just endless different versions of a reduced set of queries, charts or reports.

What IT is not happy about, though, is to let you make mistakes.

The “raw data” advocated by tool vendors  (or the “big data”, which are so often confused with) often house all sorts of pitfalls and traps that may create such a mess for the analysis that, at the end, you will need a professional to clean them in a repetitive and consistent way. On the other hand, the chances that you are going to find your business concepts into raw data or undocumented databases are often so slim that you have to resort to a professional to carry the job over the line.

At that point, maybe, it is worth cooperating with IT, who in turn is more than happy to leave all that little interesting shiny stuff to you.



Real Time BI - The Other Side of the Moon

Real time BI is still perceived as one of the huge challenges that BI may face in an evolving corporate environment.

There are some companies that try to address it by adopting specific software designed according to specific real time concepts. While the outcome of implementing this software is often remarkable, low impact and low cost solutions are often readily available int the company landscape. As it can be easily guessed, it is the actual business case and the related financial figures that drive the implementation.

The Domain

The domain in which real time BI technologies or methodologies are to be applied is often prone to misunderstandings.

We recall how we define as BI Business Intelligence: it is the complex of what is necessary (applications, data, processes) to support control.

The control of operational systems by real-time mimics and dashboard is not business intelligence. It is just an activity, not a decision making tool. These systems are meant to an end (the production of a certain amount of goods, the security of a compound etc.) but they do not help identifying what the end should be. Though, the data generated by them, when aggregated and transformed in metrics and analytics, are generally a factor in the decision making process.

A submarine has an extremely complex control centre, with thousands of systems reporting in real time data about their status. However, the commander grounds his decisions on far fewer inputs, deriving either from specific tools (communication systems) or from the transformation of those data into meaningful metrics (tactical consoles or the old good annotated maps). All those screens and quadrants are not to be considered BI systems because they support directly operations and are not used to identify the purpose of the submarine actions in that specific moment of the mission.

So, according to our view and for the purpose of this article, real time BI system provide an aggregated, transformed and enriched view of operational data to support decisions.

Real Time Exactly Means?

A true real time BI is hardly feasible because commercial operating systems are not real time operating systems. With the term "real time" we actually mean that the lag between a transaction occurring and its effect being available for consumption by users through BI tools, is low, compared to the lag experienced in batch systems. While it is ubiquitous the idea to update data once a day, shorter interval are becoming more and more common. At the lower end of this trend, when intervals become short (less than an hour) or very short (few minutes), the batch update blends with real time.

When business talks about Real Time, it is often satisfied with these lower bound latency levels. As you can see, this matches with the idea above that operational systems do not account as real time BI systems.

The Technological Approach

Implementing real time analytics in a transactional environment requires, obviously, a choice of the enabling technologies.

From a technology perspective the critical driver are transaction frequency and data volume.

Some applications like Telco  or Web Analytics may have a very high transaction frequency. The transaction inflow may be metaphorically assimilated to a stream which properties are associated to.

This kind of high frequency data is usually handled by sw platforms designed to manage "feeds" of data. These platforms tap into some sort of APIs or messaging applications, sequentially calculating time based metrics and KPIs. They basically aggregate the last "x" transactions to calculate a time dependent measure/metric/KPI.
There are plenty of solutions on the market, each with its own flavor, that can implement this features. This is no longer new, today, but it is something that can be purchased off the shelf. There are dozen of web analytics solutions (Google and Adobe dominate the market) and just a little fewer generalists (Vertica at one end, Vitria at the other, some Oracle and SAP integrated solutions  etc.).

Just a step below these challenging environments, there are other cases that may benefit from a more low profile approach and some intelligent design. Not all the companies have millions of facts to be processed every minute and yet, need near real time processing, being a lag in the order of magnitude of tenths of minutes to few hours, acceptable.

The Zero Footprint Solution

At the lower end of the complexity spectrum, we can simply query the transactional systems for up to the minute data. This solution is particularly viable when tools featuring a semantic layer like BusinessObjects, Cognos or Microstrategy are available, since the queries issued are likely far less straightforward than those issued toward a structured datamart, and hence not within the user reach.

This method may work beautifully and be a long term solution with a minimal effort and impact.

Obviously, there are a number of things that can go wrong.


The first issue is that, by definition, we are querying and presenting siloed data, the same siloed data that the traditional DW paradigm is designed to integrate and enrich. While these siloed data, presented in an interactive format are often enough to take real time decisions; if this were not the case, the tools mentioned above feature the solution. They all provide some sort of data federation at query time or data integration from multiple sources at report level. So, for example, the raw customer description deriving from the transactional system may be integrated with the much more sophisticated conformed dimension and all its attributes.


An issue pertaining to the technical domain that may jeopardize this effort. Operational systems database structures are often inherently unfit to handle the BI queries efficiently. They are optimized to return, add or update small sets of rows very frequently. BI queries typically span large table segments, aggregating a non negligible percentage of the total rows. These queries may either perform badly or hinder the performance of the operational queries. There is no zero footprint solution to this issue.


Replication is a database engine feature that synchronizes two instances of the same database. The master database is updated by the transactional systems, then it is replicated into a slave, which will be exactly the same as the source db. This one will be queried by the BI systems. Obviously the replication lag time is a key factor but modern solutions can reduce it to negligible levels.

In this way the two systems are isolated and the BI system can run its queries autonomously. Obviously, the possible requirement for federating these data with the DW will remain intact.
If query performances keep being an issue, though, a different solution is required.


A possibility to improve performances is replicating just the data needed for the real time decisions. This is a feature that may be included in replication or may be implemented ad hoc. If the latter is the case, it is highly advisable to rely on timestamps or last modified dates/times and not on Changed Data Capture, since hashing may easily become complex enough to slow down the entire process.




If the BI environment includes cubes, than cube processing times adds to the real time lag. Even though modern cubes are very fast at loading and processing very large quantities of data, they are hardly compatible with a real time environment. While is in theory possible to load small batches of data and just process those, this may easily prove to be a complex and unreliable mechanism.

A good alternative is to consider pure rolap cubes, where the cube just becomes a proxy to the relational database underneath. This is an obvious sacrifice of the most important (and many would say the only) feature OLAP has to offer, the pre-aggregation of underlying facts but may preserve a consolidated path to access data.


If Everything Else Fails


If none of the methodologies described above proves to be working for the case being considered, at last, it is possible to consider firing the ETL process at short intervals. This is obviously the most complex solution to adopt.

First, you have to make sure that every dataset being lifted is inherently coherent. For example, there is no point in updating the products at mid morning if they do not possess yet all the attributes that are required to perform the subsequent transformations.

Second, some metrics which are based on long term counts and averages may not be efficiently calculated according to an incremental logic. For example, if customers are segmented upon the long term pattern of their purchases, it is generally complex to update these segments incrementally.

Third, and most important, different users may receive different figures during the day just because they issued a query at different times. While this is perfectly acceptable, and indeed desired, for the metrics which drive the decision to be taken upon the real time analytics, it is a potential endless source of confusion for all the other information. For example, the monthly closure process requires to draw a line on various streams of information   and managing it while keeping on updating all the rest is, indeed, very complex.




This article is just a short review of the alternatives available to an organization that is considering the adoption of real time analytics. The basic idea is that specific real time technologies are necessary in some challenging environments but they are not the most cost effective solution in other. In this case, a smart use of existing technologies appears to offer the best return on investment.