I already wrote a post about this topic in the past but this is much more interesting and pretentious than the old one.
It is often mentioned as one of the cornerstone benefits of Data-warehousing and Business Intelligence (DW and BI respectively), it is always depicted as the end of a bane, a sort of data management sacred Graal: it is the adoption of a Single Version of Truth throughout the company.
However, sometimes, this very idea is interpreted quite naively by the business. The upper management loves to hear that the numbers are exactly these and not others, so they are often given what they want without much hesitation. The truth is: the numbers may well have been different, depending on the people and the use they were assembled for.'
As an example, it is obvious that under the simple term "Sales", there are many different definitions of the same number. Sales for salesman are the value of signed contracts, while the sales for operations are the orders invoiced. For finance, sales are the balance of a specific account while customer service is interested in returns too. They will all call sales what actually are different measures. While sometimes business people find hard to understand this difference, it is clear that this difference exists. So, what do we mean exactly with the expression "Single Version of Truth"?
The term "Single Version of Truth" identifies the ability to be able to decompose every result obtained from a query in terms of the results of another query.
In other terms, providing the Single Version of Truth means being able to answer the question "Why EMEA sales for 201X here are £YYY and here are £YYZ ?"
The differences among different versions of what is supposed to be the same measure, arise from various factors. While we stick with the sales example, our coverage may be easily extended to be as general as possible. We are also supposing to actually have a central DW to refer to.
Different system provide the same data in different formats to the DW. A company, or a group, especially if they are the result of a recent aggregation, may have different systems managing the same process for those which were the old separate entities now brought together. The figures may come in differently from different systems. For example one systems may feature tax and values separately while the other has single value and the tax percentage. Order header level measures (shipping expenses, discounts etc.) may be provided as special products appended to the order or in a separate table being extracted in an autonomous flow etc. In short, data format may be, and in general is, different.
Data may be not homogeneous from the beginning since there are two or more radically different processes at their source. In this case they require different dimensional models to be described, with different definitions of measures.
For instance, consider tractor manufacturer: it will feature two utterly different sales processes: new tractors and servicing. The former will deal with lead times, personalizations, credit etc., the latter with parts availability, licensed resellers etc.
Even at grassroots level, if we compare a paper invoice of the two businesses, they will look massively different, the only thing in common being the fact that they both have an header and some details with numbers.
Trying to apply the same rules to these two processes is obviously meaningless. For example, including "Returns" as a negative to sales total, has the clear purpose to identify a commercial margin; a defective part may be returned but a defective tractor is never returned, it is fixed for free, bearing a cost for the supplier, while replacement is an unlikely event. So the new tractor sales measure definition need a rule to account for these occurrences and may not mention returns at all.
In short, below the same measure there are different processes.
The dimensions used to slice and dice measures may be subtly different than expected by the consumers, though having the same name and look.
For example there may be different warehouses which shipments are coming from but they also are part of different companies controlled by a central warehouse holding which in turn is part of a consolidated group.
So, shipments may be associated with the physical warehouse actually shipping it or to the company with the same name running the warehouse. Consider an emergency shipment, where the manufacturing plant is instructed by a warehouse to skip the usual procedure and ship directly to the customer. It will be associated to the company but not with the physical warehouse. The finance controller will be interested in it but the field manager not.
Another dimension that can easily give rise to this kind of mismatches is the customer dimension. The customer to which a sale or any other transaction is associated may be of disparate kinds. They may be individuals, or family households, or single companies, or invoicing points with their addresses, or group holdings, or franchisers, or shipment points with a specific address down to the receiving door, or temporary company associations that buy once all together and later separately, or or or ... Additionally, all the data pertaining to customers quickly grow old and, unless there's a sound MDM initiative in place to ensure that they are kept current and propagated, they become a relevant source of mismatches. Trying to view customers as an homogeneous category is often misleading and may lead to subtle confusions.
Finally, even simpler dimensions may hide differences in definitions. Let's consider for instance the Cost Center; something bought by a cost center for the use of another cost center, may well be classified under both of them depending on the type of analysis.
In a more general perspective, rapidly changing Master Data, when not managed centrally, are one of the hardest challenges to providing a single version of truth.
Transactions may not be obvious in their interpretation and may derive from different systems. Invoices will provide a different kind of information than accounting entries but both may provide sales. Beside, in every system, we have some transactions that are not easy to be identified. They are general manual transactions involving dummy products and customers and are issued for "miscellanea" or "Other" reasons. They generally consume a disproportionate amount of time to make sense of them and it is hardly possible to find consensus on their processing within the DW.
Last but not least, even when the meaning of every dimension and every transaction is well defined and understood, people do actually need different numbers. Different measures, calculated in different ways, are called with the same name and generate all the possible confusion. While names should be somehow adapted to the actual content, the critical question to be answered is: what makes this "Sales" number different from that "Sales" number? Knowing the answer, as stated above, is providing a Single Version of Truth.
The Logical Model
To actually being in the position to know the answer to the question stated above, we must shortly recall the nature of data-warehousing itself. While the formal definition of DW may vary, all the authors agree that there is going to be a "Transformation" phase (the T in ETL). This transformation(s) is required to obtain a physical model matching the logical model and an easy to query data arrangement. From these data, through further stages and further calculations, the actual content to be served to users is created. That is, we have two, ontologically different, transformation sets to be considered.
If A is a set of data, T the transformation and B the resulting set, we can express the ETL process like
Notice that in general the inverse transformation may or may not exist, depending on the transformation itself, that is,
may or may not be true.
For instance, when some sales date are aggregated by month, in general there is no inverse transformation that could give us the sales by day starting from the aggregated result-set.
So, every time we run a transformation we risk losing information. If we are querying the A and B data sets separately, there may be no way to compare the two results to verify what the difference is.
However, the norm in the enterprise environment is to cover multiple areas with different datamarts, such that we will have this arrangement.
If we need to compare B with C, it just enough that one of the Ts feature an inverse transformation, let's say T2, such that a path from C to A to B may be established. It is then possible from the result set C to go back to the original dataset A by T2-1 and then to B by applying T1.
Comparing result sets derived from user queries, on the other hand, may be split in two cases: queries that derive from different datamarts and queries that derive from the same datamart.
In the former case we can easily see that we are just in the same case as the ETL queries, with just a step more, that is, the two transformations from the datamart to the user.
The latter cas is the more interesting. Let us suppose that we have two queries (SQL is assumed here, but the idea can be extended) which return different results. In principle, it is always possible to "morph" one query into another, adding or removing fields (objects), tables, joins (links) and conditions (where clauses or filters). There is one or more transformations that, by a finite number of steps, turn one query into another.
In a formal notation, if Qs are the queries, A the datamart and B,C the results we have.
Q1(A) = B
Q2(A) = C
If Q1(A) -> Q2(A) then B -> C (I am not using a proper limit since we are dealing with discrete variations of a script, which is not a very mathematical entity)
In this process we expect to have
B + Δ1B + Δ2B + ... = C
That is, every discrete query variation returns an associated result variation.
If we consider the value of a single field in B, ΔBs and C, we do expect that ΔBs are between 0 (no change for a given variation) and the aggregated value for the entire A. In the simplest case, deltas tend to become smaller and smaller and the results converge to C.
There are cases, though, when deltas, on the contrary, suddenly have a variation of order of magnitudes.
This happens when there is no enforced relation between two entities. To stay in SQL area, we are having a fan effect or a Cartesian product. If we meet this occurrence (and, of course, we are not making a trivial mistake while querying), the B transformation is not going to converge to C.
This means that every datamart should feature rigorously fact tables with data at exactly the same grain if we want to be able to produce a single version of truth according to the meaning we have given before.
This point has a number of consequences on the overall DW design as a part of an enterprise wide BI solution and MIS system.
Proposal for a Real Life Model
Not surprisingly, the previous paragraph has reached a conclusion that is well known to every BI professional. The classic books on the subject that describe the star schema as the principal datamart layout prescribe a single table at the lowest practical grain. This has many other advantages but, as we discovered, is also essential to maintain the possibility to identify a single version of truth.
A bit less studied, though, are the consequences on the DW design as whole.
If every datamart models a single process at the lowest possible grain, what are the consequences for a DW where we have different datamarts at different grain and yet we have to provide data coming from more than one of those?
The issue rising is pretty well known: fanning and filtering effects, may affect the results. These effects are implicit in the way SQL is working, but similar effects may still be present in MDX or DAX or any other query language. Also presentation tools often struggle to to show together data at different granularity.
This is a fundamental issue that can't be fixed by a simple methodological trick. These differences are inherently embedded in process nature and they leak into the data describing them. This demands a coherent approach.
To tackle the issue we can envisage a DW with a specific modified structure. Taking inspiration from the well know Kimball theory, we introduce, at staging level, an intermediate structure that we call "Process Mart". The Process Mart will feature:
- Fact records featuring the lowest granularity possible, with a clear key field identifying it.
- Fact record fields actually mirroring the quantities modeling the process i.e. summing all the homogeneous record numeric fields the result is still a number with an actual meaning.
- The sum of a numeric field across all the records, is a number that still retains an actual meaning.
- All the relevant dimensional and attribute fields.
- As many key fields from other process marts as possible.
Data from every process mart, then, will be consolidated in datamarts; each datamart being derived by one or more process mart. For example, finance will be interested in sales and payments, while sales will be interested in sales and costs (i.e. margins), HR will be interested in people skills but also the cost of those people etc.
This "consolidation" process, may happen only if we can connect a process mart to another by their keys. The different granularity may be addressed in various ways, to be chosen upon the specific interest of the users the query tools that will consume data from the datamart.
- The datamart may be at the most aggregated granularity level, aggregating the more detailed data in a single measure.
- The datamart may be at the least aggregated granularity level, allocating the more aggregated data on the most detailed rows.
- The datamart may be at the least aggregated level, with more detailed data being pivoted out
- The datamart may be at the least aggregated level with aggregated data artificially brought to the same detail by adding dummy dimensional attributes.
This is not the place to discuss each of the options in depth; what matters us is their consequences on the ability to find a single version of truth. The crucial point to understand is that the process mart may be sliced in terms of the other process marts, thus preserving the traceability between the two. Operations like the aggregation or the allocation that, normally, would destroy the backward traceability, in this model remain invertible. I can, in every moment, identify the rows involved in the query using the key of the other process mart.
For example, let us suppose to have the orders process mart and the payments process mart. More than one order line may be covered by a single payment and a single order may have more than one payment, eventually not related to an order line. Than we have two datamarts, one including just the orders for the commercial back office and one combining orders and payments for the bank manager. Both datamarts have a "Sale" measure but the former keeps all the details about the orders, the latter has the payment detail and the order is just a property of the payment; no detail of the order lines is preserved.
Now, we have two queries from the two datamarts, both including EMEA sales in Q1 for blue widgets. They are returning a different number, how can we explain the difference? The process will require to:
- Identify all the payments included in the query from the datamart.
- Use the payments to select the orders involved in the orders+payments process mart.
- Apply the same order selection to the orders only datamart
- Apply the remaining conditions, where applicable, to the orders only datamart.
At some point, during the process, the reason for having a difference will come out.
This has been a lengthy explanation to try to ground in a formal framework some of the criteria that subtend the idea of "Single Version of Truth". This idea is integral to the business requirements that lead down to the datawarehouse path , and it is going to remain despite all the hype on "unstructured analysis". I hope that this post is going to help someone in designing better DWs and promote a conscious approach to achieving the Single Version of Truth.
Maintaining the internal coherence in the way described above is a process that stretches along the entire life of the DW since it will be challenged by the inevitable implementations but now you will understand that it is well worth its complexity.
We can have reports and numbers even without a DW, but they will not be comparable and will be way less meaningful, useful and authoritative. If our DW/BI system is unable to do anything better, than why bother to have one?
The Ugly Truth About "One Version of the Truth" where you can have a general overview of the problem with a couple of big names references.
Is There a Single Version of the Truth? where Robin Bloor gives a nice, customer-centric, vision of the problem.
The Myth of One Version of the Truth This is an Oracle' paper that goes in further detail along the same line discussed in this article.
Why There Shouldn't Be a Single Version of Truth where Chuck Hollis gives us one of the classic articles on the subject.
Is this the End... ? This is a guy who is not understanding...