Being short in life is not advisable, there are statistics that say that your paycheck has a stronger correlation with your height than with your studies and experience. Unluckily, it is one of the things you can do very little about if the fate was not good enough with you.
On the other hand, having a "short" data warehouse may help a lot.
In the classic data warehouse paradigm, an integral part of the process is to extract and transform the data to populate structures adequate for querying.
It is often reported that building and maintaining the ETL process is the most resource intensive technical process among those necessary to build a BI environment. The flip side is: this also means that this is the process where the biggest gain in efficiency can be achieved.
Many vendors in the last three decades supported this view. The wave started with the ETL tools that (almost) avoided writing code, followed by the tools that let you design a data warehouse just conceptually while the tools takes care of all the rest. ERP vendors all provide extractors or other solutions to simplify the ETL process.
This paradigm is loosing some relevance with the appearance of in-memory solutions but it is still one of the focal points for DW professionals.
So, what is the most cost-effective approach to ETL? What is the most efficient initiative to reduce the amount of resources devoted to ETL?
Simple, keep it short.
There are plenty of ways to measure software complexity. Just try googling it and you will receive pages and pages of academic articles. For our purpose, let's just consider the number of objects (database fields) that are involved in the process multiplied the number of times they are invoked in the transformation. It is a very simple measure, roughly correlated with the actual complexity but good enough to show the point.
The amount of objects that are required to correctly model a business is pretty much given. You may implement your marts in steps or deliberately starve the business but what is needed for the business to work is generally well identifiable. There is not much to do on this side. On the other hand, the transformations that you may need to turn your raw data into something consumable by a front-end application may greatly vary.
Building these transformations is the most expensive stage, the one that introduces more complexity, consumes more time and it inherently creates the lineage problem.
There are two approaches to simplifying the transformations.
The most common raw data for a classical data warehouses, are based on facts derived from documents: invoices, Orders, bills of laden, warehouse picking notices etc. There are not many ways to model these documents: they all feature a header/footer, with some attributes and some quantities defined at that level; and a body that, generally contains rows, with other attributes and measures.
Mind that that representing the documents in this format is a sort of "natural modeling", since it embeds in data structures the way they are perceived and "lived" by the business.
The difficulty rises when, in the source system, they are not modeled as a header/footer logical data construct but they are dispersed in different tables, sourced from logs or queues etc. In this case, the integrator job is really uphill. Obviously there is no technical shortcut to this but, on the project management side, the DW/BI team may ask to the people responsible for those applications to provide themselves the header/footer structures required. This is generally advisable since the people managing the applications are also those who better know how to extract and transform the data and it is easier and less resource intensive for them rather than for a team which doesn't have a very specific domain knowledge.
The flip side of the coin, is that you will immediately need a process contract with another team and a technical contract among different applications and platforms. The contract details are often overlooked: what to do when a new field comes along, what is the tables update policy, who is in charge of signing off updates? Once again, instead of just a technical wizardry, what we really need is a solid process ensure a constant flow of reliable data.
At the other end of the DW process there is another approach that may greatly simplify the ETL process and the need for integration. Some large BI suites feature a back-end layer including a data federation engine and a semantic layer.
With data federation, the consumer can be shown a single data structure that integrates physically separated data as if they had undergone the process of having been copied, joined together, cleaned and saved to a table. It is generally easier and less expensive in terms of development, maintenance and computing resources to federate two tables rather than maintaining the appropriate ETL process.
This doesn't mean that data federation is easy and ETL is hard, but the balance in most cases goes in favor of federation, especially if it is an out of the box feature like it is for some BI suites.
The semantic layer, on the other side, embeds rules and descriptions that may let the ETL process stop short of some final operations, thus becoming simpler. A semantic layer may also include complex calculations, relieving the ETL process from the time and resources necessary to execute them. The calculation burden is moved at runtime and it is executed only if necessary. This is a semantic layer's virtue that it is often overlooked while considering which BI suite to use in an organization.
We can summarize the two point above in a simple rule: your ETL process must be "short", relying on one side on pre-digested input from the data producers and on smart tools to consume data on the other side.
So, sometimes, being short in life is not that bad, after all.