Tag Archives: Raw Data Vault

Concerning the various layers that we describe as part of the data warehouse architecture.  Basically, this is MDM for the business terms used in the data warehousing industry. 

Ronald Damhof wrote some excellent posts on his blog and kicked off a much needed discussion on this topic.  The layers that we actually deploy in dv-based data warehouses today – given a common categorization – probably don’t vary as much as we think.  In any case, they would probably all fit into a small set of valid variations. 

The focus of this discussion has been the Raw Data Vault or Raw Vault.  And in fact the term has been used in connection with different meanings.  As Ronald points out, the Raw Vault from an “auto-generated” perspective is of limited value in the overall DV architecture as it is (just as the sources are) on the left side of the semantic gap.  The Data Vault is based on Business Keys (Hubs) and these are defined by the business.  The keys we derive from the sources are not these same keys. 

The Raw Vault from a perspective of an “auditable, DV-based, Business Key aligned” perspective represents a layer that moves towards the semantic integration but does so only to the point that it can remain traceable back to the sources without the need for soft business rules.  Because the DV based data warehouse is charged with auditability diligence (an integrated mirror of the sources), this layer needs to be persisted before soft business rules are applied.

So the layers that we work with, now going back to Ronald’s blog, include (1) Staging – either persistent or not, (2) Data Vault, (3) Staging out/EDW+/CDW+, (4) datamarts.  Let’s look at each one a bit more closely:

(1) Staging.  Persisted or Not.  This layer is a copy of sources primarily utilized for supporting the process of moving data from various sources to the data warehouse.  This layer is 1:1 with the source systems, typically in the same format as the sources, has no rules applied, and is commonly not persisted.  Alias: System of Record, SoR, and Stage-in. 

(2) Data Vault.  The core historized layer, aligned with business keys (to the extent possible), all in-scope data is loaded and persisted, auditability is maintained.  At the heart of all data warehousing is integration – and this layer integrated data from multiple sources around the enterprise-wide business keys.  Alias: EDW, CDW, Raw Vault, data warehouse.

(3) Staging out/EDW+/CDW+.  This layer represents the data following the application of the soft business rules that may be required for a) the alignment with the business keys, and b) for common transformations required by the enterprise.  This layer makes the final move to the enterprise-wide business view, including gold-record designations, business driven sub-typing, classifications, categorizations and alignment with reference models.  Alias: EDW, CDW, Business Data Vault (BDV), Business Data Warehouse (BDW), and Mart Stage.

(4) Data Marts.  These represent the presentation layer and are intended to be requirements-driven, scope specific, subsets of the data warehouse data that can be regenerated (so typically not persisted).  With the DV data warehouse approach, these are very flexible and NOT restricted by federated mart paradigms (limited to dimensional models, maintaining conformed dimensions, persistance, etc.).    While this layer tends to be mainly deployed using dimensional modeling, marts can also be flat files, xml, and other structures.  Alias: DM, Marts. 

A quick look at the Alias possibilities and we can see that the terms are not universally applied.  However, the delineation of the core layers represents a common understanding.  Deployments that differ from this common view represent exceptions.  In most cases, these exceptions are valid and represent acceptable and encouraged data warehousing practices.  Of course they could also represent an issue that will cause problems in the long run.  But by having a standard understanding of the core layers, we will continue to have a reference point for our analysis and discussions.