Archive

Tag Archives: Anchor Modeling

Ensemble Modeling Forms:  Modeling the Agile Data Warehouse

Anchor ModelingData Vault ModelingFocal Point Modeling.  To name a few.  In fact there are dozens of data warehouse data modeling patterns that have been introduced over the past decade.  Among the top ones there are a set of defining characteristics.  These characteristics are combined in the definition of Ensemble modeling forms (AKA Data Warehouse Modeling).  See coverage notes on the Next Generation DWH Modeling conference here (and summary here).

The differences between them define the flavors of Ensemble Modeling.  These flavors have vastly more in common than they have differences.  When compared to 3NF or Dimensional modeling, the defining characteristics of the Ensemble forms have an 80/20 rule of commonality. 

  • All these forms practice Unified Decomposition (breaking concepts into component parts) with a central unique instance as a centerstone (Anchor, Hub, Focal Point, etc.). 
  • Each separates context attributes into dedicated table forms that are directly attached to the centerstone. 
  • Each uncouples relationships from the concepts they seek to relate. 
  • Each purposefully manages historical time-slice data with varying degrees of sophistication concerning temporal variations. 
  • Each recognizes the differences between static and dynamic data.
  • Each recognizes the reality of working with constantly changing sources, transformation and rules. 
  • Each recognizes the dichotomy of the enterprise-wide natural business key.

From that foundation of commonality, the various forms of Ensembles begin to take on their own flavors. 

  • While Data Vault is foundationally based on the natural business key as the foundation of the centerstone (Hub), both Anchor and Focal Point center on a unique instance of a concept where the business key is strongly coupled but separate from the centerstone (Anchor, Focal Point). 
  • Both Data Vault and Anchor aim to model Ensembles at the Core Business Concept level while Focal Point tends to deploy slightly more abstracted or generic forms of concepts. 
  • Data Vault and Focal Point utilize forms of attribute clusters (logical combinations) for context while Anchor relies on single attribute context tables. 

     And there are other differentiating factors as well.

There is one thing that we can all agree on: modeling the agile data warehouse means applying some form of Ensemble modeling approach.  The specific flavor choice (Data Vault, Anchor, Focal Point, etc.) should be based on the specific characteristics of your data warehouse program. 

* Learn more on Anchor Modeling with Lars Rönnbäck here: Anchor Modeling

Advertisements

Forward: my book “Modeling the Agile Data Warehouse with Data Vault” available on Amazon in USA, UK, and EU.  You can order your copy here.  This is a data vault data modeling book that also includes related data warehousing topics including some new concepts such as Ensemble Modeling. 

Ensemble modeling is based on the core idea of Unified Decomposition™ (please see my blog post from October 3rd 2012).  Basically this idea recognizes that we want to break things out into component parts for reasons of flexibility, adaptability, agility, and generally to facilitate the capture of things that are either interpreted in different ways or changing independently of each other.  But at the same time data warehousing is about data integration (a common standard view of unified concepts).  So we also want to bring together the context around the core concept. Ensemble Modeling his is a unifying theory.  Unifying from a data modeling landscape perspective – and also unifying as the primary theme of the modeling approach.  With Ensemble Modeling the Core Business Concepts that we define and model are represented as a whole – an ensemble – including all of the component parts that have been broken out.  An Ensemble is based on all things defining a Core Business Concept that can be uniquely and specifically said for one instance of that Concept. 

So an Ensemble is effectively designed in the same way we would design a core Entity in a 3NF model.  We would establish a key based on a natural business key for the Core Business Concept and then include the context attributes that depend on that key, as well as FK relationships to instances of other concept tables relating to that key. 

But unlike Entities and Dimensions, Ensembles are actually collections of integrated lower level constructs – a result of breaking out core concepts into logical component parts.

Ensemble Overview

In the figure above we can see an example of a Core Business Concept “Customer” modeled as an Entity (left circle) versus modeled as an Ensemble (right circle).  As with all Ensemble Modeling forms or variations (Data Vault Modeling, Anchor Modeling, 2G modeling, Focal Point, etc.) the set of component parts together form a concept. 

Consider the definition of ensemble:

All the parts of a thing taken together, so that each part is considered only in relation to the whole.

It is important to note that the Core Business Concept is fully and completely defined in either model.  At the same time, the parts in the Ensemble modeling variation exist only as parts that make up the whole.  So they are not themselves Core Business Concepts but rather lower level components.  In Ensemble Modeling the Ensemble is a representation of a Core Business Concept including all of its parts – the business key, with context and relationships.  This Set or Grouping of all related parts is treated as a singular concept.  Individual parts do not have their own identity and so can only be considered in relation to the whole.

Data Vault modeling embraces this idea.  Note the vaulting and unvaulting in the diagram introduced in my blog entry from September 27, 2012.  Here we can see that when we are vaulting we are practicing unified decomposition – since the core business concept is broken out into component parts but all parts remain circling the business key. 

But data vault is only one of the flavors of Ensemble Modeling.  There are several approaches within this paradigm.  They all share certain core characteristics and one of the fundamental ones is in the high level view of the table constellations that are created. 

Ensemble Flow

Ensemble Flow

In this diagram we can see a simple architectural view of a data warehouse.  Starting from operational source systems and moving to the EDW layer and lastly to the data marts we can see how these concepts fit in.  Why does this ensemble modeling pattern work?  Because when changes need to be tracked, it is more efficient to split out and isolate the changing parts.  For more on Ensemble Modeling and in particular the data vault modeling approach, please order the book Modeling the Agile Data Warehouse with Data Vault. 

© Copyright Hans Hultgren, 2012. All rights reserved. Ensemble Modeling™

Unified Decomposition sounds a bit like an oxymoron.   And sure enough the combining of unifying with the idea of breaking things into parts does seem innately contradictory.  But upon closer inspection this idea makes a good deal of sense – especially for the field of data warehousing. 

With an enterprise data warehouse (EDW), we want to break things out into component parts for reasons of flexibility, adaptability, agility, and generally to facilitate the capture of things that are either interpreted in different ways or changing independently of each other.  At the same time a core premise of data warehousing is integration and moving to a common standard view of unified concepts.  So we want to tie things together at the same time as we are breaking them out into parts.

Unifying

If you ever worked with object oriented design, you are probably accustomed the idea of encapsulation.  The idea of encapsulation is to bring together methods and data into the same object so that everything that deals with that object is contained within it.  One of the advantages of this kind of design is the ability to take an object class from one area and place it in another area knowing that everything it needs to exist (keys and descriptive context) and to perform (behaviors) moved along with it.  The object is self-contained. 

Another way to look at this is to think about a “self-contained underwater breathing apparatus” or “SCUBA” for short.  The idea is that everything you need to breathe underwater is contained in the same thing (apparatus).  You don’t need hoses to feed you air from a boat above.  Because the air, the tanks, the hoses, the mask, the regulator, and etc. are all contained in the same apparatus. 

These concepts both deal with bringing component parts together to form a whole.  This is the idea of unifying.  That we encircle everything we need to define a concept and keep all of the component parts together in this circle somehow.    

Breaking into Parts

The other part of unified decomposition is the idea of breaking things into component parts.  The decomposition is in some ways the opposite of unifying.  If we strive to keep things together, why then would we want to break them apart?  One major reason in data warehousing is that things change.  In fact things change all of the time.  If there is one constant it is that things change.  But not everything about a concept changes at the same time. 

Unified Decomposition

If the concept parts are all kept together (in the same table for example) then that would mean any change to any one component part would have an impact on the whole.  If we want to limit the impact of the changes we need to isolate the part that is changing.  In data modeling (especially for data warehousing) this theory is being deployed in many different forms.  If we are designing a database that needs to integrate data and also needs to maintain history then the benefits of decomposing the core concepts is very compelling.  This happens in Dimensional modeling with mini-dimensions and factless facts, it happens in Data Vault with hubs, links and satellites, but it also happens with other approaches such as Anchor Modeling, 2G and Focal Point.  The common theme is data warehousing and the common thread is decomposition.

Putting it all Together

If all we did was to break things apart then we would be missing half the story.  Much like the modem translates a digital signal into an analog signal (modulation) it is not of much use without taking that analog signal and translating it back to digital at the other end (demodulation). 

Taking a core concept which is represented as an entity (physically a table) and breaking it into component parts (lower level tables, held together by a common key) is the “mo” (modulator) part of the modem.  While this is great for data warehousing agility, it does not do much for the business users.  People who are considering some form of table decomposition variation (Data Vault, Anchor, 2G, Focal Point) are often stumped when they think about how to get their business intelligence team to access this data for their reporting and analytics.  Really the answer is simple.  Don’t access it.  We need the “dem” (demodulator) part of the modem – we need to first translate it back to digital – or in this case move it back to a combined form (entity).  And this combined form is typically a Dimension in a Data Mart. 

Think of it this way.  In your data warehouse architecture, you take a concept which is represented as a combined form (entity) and you break it into parts (hub, link, satellite).  You then put it back together into a combined form (dimension) and deliver it to your downstream users. 

Unified Decomposition Data Vault

Unified Decomposition Data Vault

Because we need to put it back together before we deliver to the data marts, the factor of unifying is a critical feature of unified decomposition.  That is to say that the modeling pattern that we deploy must somehow address the unifying at the same time as it is breaking things into parts.  With data vault modeling decomposition is the breaking out to Hubs, Links and Satellites and the unified is accomplished through the direct connection between the Hub and the surrounding Satellites and Links. 

 © Copyright Hans Hultgren, 2012. All rights reserved. Unified Decomposition™

Satellites in the data vault modeling approach contain all context (context attributes, descriptive properties, defining characteristics) concerning core business concepts (Hubs) and their relationships.  With EDW deployments they also contain all of the history – all of the time slices that capture changes over time.

Because Satellites can split out the things that can change into separate structures, the impact of changes over time is lessened (isolated to smaller and fewer structures). This concept is high on the list of features that allow data vault EDW deployments to more easily adapt to changes (AKA Agility). 

But as modelers we still need to make design decisions concerning how many Satellites to create and how to split out and distribute attributes among them.  We know from course materials, websites, blogs and other publications that we can split out Satellites by a) rate of change, b) type of data, and c) source system.  But these are just ideas concerning factors we could use in our design.  There are no set rules that guide our decision process outside of these general guidelines.  But that stands to reason.  After all we are data modelers, architects and designers.  We should be modeling, architecting and designing things. 

In this article we want to explore an additional factor to consider while designing Satellites: the average number of Satellites per Hub.  To better analyze this factor we can look to the extremes (bookends) of possibilities:

     One Satellite per Hub      versus      One Attribute per Satellite

In the first extreme we create just one Satellite for every Hub. 

When we do this we have effectively re-created a conformed dimension.  With some differences including these two primary ones:

a)      All Satellites are pre-built to track history (innately accommodate Type-2 deployments)

b)      The Business Key is separated from the Context

If we compare the agility of this structure with that of a conformed dimension we can see that there is little difference between the two.  If there are changing relationships between Hub keys then we are still more agile with data vault modeling but if the changes are primarily to context attributes then the single-Satellite pattern is not significantly more agile than either dimensional or 3NF modeling.  Why?  Because changes to attributes will always require changes to the Satellite.  These changes require modeling, database changes, ETL changes and related testing, documentation and other development lifecycle components. 

On the other end of the spectrum we could also contemplate one attribute per satellite.  This means that each and every attribute would have its own data structure, its own Satellite. 

Actually we could argue that this is really the proper interpretation of 3NF in an EDW.  Because in fact having context dependent on its key in this case is actually having context related to an n+1 part key (where n is the number of parts for the key before history and the 1 we add is date/time stamp).  In terms of context attributes with slowly changing dimensional data, this means each attribute could change independent of the status of any other attribute.  * Note: there is of course the concept of an attribute cluster where a group of attributes do change together if and when they do change

Now from an agility perspective this is a far more adaptable model.  Additional attributes will never cause changes to any existing Satellites.  Note that all work is not avoided even in this case because we still do need to add tables when we get new attributes. 

This concept is inherent to the patterns of Anchor Modeling.  I would encourage you to visit the Anchor Modeling site to learn more (there is also an interactive modeling tool available that you can use to better understand the dynamics of working with Anchor Modeling).

Data Vault Modeling

So what about Data Vault modeling?  What about Satellite design and the number of Satellites?  Well with data vault modeling we have observed that somewhere between 3 and 5 Satellites per Hub appears to be common.

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

Rates of change.  There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.    

Types of Data.  There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

Record Sources.  Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario. 

People often ask if we should make it a rule to just split by record source to avoid this issue.  But this is not a viable rule.  With some Hubs being loaded by 20, 50, even 120 different source systems, the sheer number of Satellites would be staggering.  So no, not a good candidate for a general rule.  In most cases we have a goal of integrating data into a common central structure.  Keeping a structure for each separate source does not help us reach this goal.

Bottom Line

So we have three (3) main factors to consider (differing rates of change, different types of data and also different record sources) each of which is a general consideration and not a specific rule. 

Perhaps the most compelling thing about Satellite design is the degrees of freedom that you have.  You can in fact model across the entire range from one Satellite per Hub to One Attribute per Satellite.  With either extreme you are still vaulting. 

The most important thing about Satellite design is to entrust the design process to a qualified data vault data modeler.  One who understands the implications of the design decisions and the modeling patterns applied.