Tag Archives: Architecture

Unit of Work Demystified

The Data Vault is about business key alignment, auditability, enterprise-wide data integration, effectively managing history, and unprecedented adaptability.  But in addition, the Data Vault is also about clarity, usability and simplicity.  The concept of Unit of Work (UOW) continues to be an elusive concept and most can agree that it is among the most difficult to understand.

Unit of Work is the label we attach to one of the factors we consider when we are designing our Links in the Data Vault. 

The analysis of the Unit of Work related to the design of our Links will typically impact the decisions we make concerning the number of FKs in a particular Link.  We may, for example, consider having two Links connecting three Hubs or we may have a three-way Link connecting all three. 

The Unit of Work is one factor used to help us represent in the Data Vault our understanding of the relationship between Hubs

So when you are working with Unit of Work analysis, you are really working with one of the factors for understanding the relationship between business keys (Links).  Just as the Hubs should be based on true Business Keys, all Links should focus on natural business relationships.   

In teaching Data Vault we have been using a set of factors to take into consideration (a correlated set of data, a way to keep grouped things together, establishing consistency between arriving data & data stored in and around Links, and grouped relationships consistent with enterprise wide business keys). 

When you are working on the analysis and design of Links, and contemplating the Unit of Work, you can consider the business defined correlations (correlated set of relationships).  For lack or a better way to say it, these are “relationships between relationships.”  

Relationship:  Customer is related to Motorcycle. 

Correlated set of Relationships.  Customer is related to Sale, Customer is related to Motorcycle, Customer is related to Employee, Sale is related to Motorcycle, and Sale is related to Employee.

We always have to deal with this situation when building Links – find the natural business relationships.  So here we proceed with a diligent design process and analyze the relationships.  Applying the factor of Unit of Work, we find that the Sale is the “correlating” factor for a set of relationships. 

Unit of Work Relationship Link.  A single Link combining Sale with Employee, Customer and Motorcycle. 

But what drives our design of Unit of Work?  In the case above it was the actual business process.  Where does the source data feed come in?   Assume for example that a commission processing system sends us a file with all Sales and the Employee on those Sales.  Another source (POS system) sends us a set of Sales with the Customers and Motorcycles.  Would our Unit of Work analysis of these source systems lead to a different design?   So how do we design our Links?

Let’s first step away from the trees and have a look at the forest… From the top of the page: “The Data Vault is about business key alignment, auditability, enterprise-wide data integration, effectively managing history, and unprecedented adaptability.  But in addition, the Data Vault is also about clarity, usability and simplicity.”    

So the guidelines for what should drive the Unit of Work can be found in the answers to these questions:

  • Are you building a single source data warehouse or an enterprise data warehouse?
  • Are you planning to maintain separate source silos or integrate around business key?
  • Are you designing source driven links or looking for natural business relationships?

Assuming you are working on an EDW (the right side of each of these questions above) then the only remaining issue is related to the question of how to maintain auditability and traceability when the sources do not align.  By asking this question, you have moved the conversation to another fun discussion – the Raw Data Vault.   

So going back to the question raised earlier, the EDW should be built on the enterprise wide business keys and natural business relationships – including the interpretation of Unit of Work at this level.  Frequently (most often) the source systems do not deliver data consistent with this view.  In these cases we need to take two steps into the data warehouse where the first landing point is the Raw and directly auditable/traceable layer.

Concerning the various layers that we describe as part of the data warehouse architecture.  Basically, this is MDM for the business terms used in the data warehousing industry. 

Ronald Damhof wrote some excellent posts on his blog and kicked off a much needed discussion on this topic.  The layers that we actually deploy in dv-based data warehouses today – given a common categorization – probably don’t vary as much as we think.  In any case, they would probably all fit into a small set of valid variations. 

The focus of this discussion has been the Raw Data Vault or Raw Vault.  And in fact the term has been used in connection with different meanings.  As Ronald points out, the Raw Vault from an “auto-generated” perspective is of limited value in the overall DV architecture as it is (just as the sources are) on the left side of the semantic gap.  The Data Vault is based on Business Keys (Hubs) and these are defined by the business.  The keys we derive from the sources are not these same keys. 

The Raw Vault from a perspective of an “auditable, DV-based, Business Key aligned” perspective represents a layer that moves towards the semantic integration but does so only to the point that it can remain traceable back to the sources without the need for soft business rules.  Because the DV based data warehouse is charged with auditability diligence (an integrated mirror of the sources), this layer needs to be persisted before soft business rules are applied.

So the layers that we work with, now going back to Ronald’s blog, include (1) Staging – either persistent or not, (2) Data Vault, (3) Staging out/EDW+/CDW+, (4) datamarts.  Let’s look at each one a bit more closely:

(1) Staging.  Persisted or Not.  This layer is a copy of sources primarily utilized for supporting the process of moving data from various sources to the data warehouse.  This layer is 1:1 with the source systems, typically in the same format as the sources, has no rules applied, and is commonly not persisted.  Alias: System of Record, SoR, and Stage-in. 

(2) Data Vault.  The core historized layer, aligned with business keys (to the extent possible), all in-scope data is loaded and persisted, auditability is maintained.  At the heart of all data warehousing is integration – and this layer integrated data from multiple sources around the enterprise-wide business keys.  Alias: EDW, CDW, Raw Vault, data warehouse.

(3) Staging out/EDW+/CDW+.  This layer represents the data following the application of the soft business rules that may be required for a) the alignment with the business keys, and b) for common transformations required by the enterprise.  This layer makes the final move to the enterprise-wide business view, including gold-record designations, business driven sub-typing, classifications, categorizations and alignment with reference models.  Alias: EDW, CDW, Business Data Vault (BDV), Business Data Warehouse (BDW), and Mart Stage.

(4) Data Marts.  These represent the presentation layer and are intended to be requirements-driven, scope specific, subsets of the data warehouse data that can be regenerated (so typically not persisted).  With the DV data warehouse approach, these are very flexible and NOT restricted by federated mart paradigms (limited to dimensional models, maintaining conformed dimensions, persistance, etc.).    While this layer tends to be mainly deployed using dimensional modeling, marts can also be flat files, xml, and other structures.  Alias: DM, Marts. 

A quick look at the Alias possibilities and we can see that the terms are not universally applied.  However, the delineation of the core layers represents a common understanding.  Deployments that differ from this common view represent exceptions.  In most cases, these exceptions are valid and represent acceptable and encouraged data warehousing practices.  Of course they could also represent an issue that will cause problems in the long run.  But by having a standard understanding of the core layers, we will continue to have a reference point for our analysis and discussions.