Tag Archives: EDW

Ensemble Modeling Forms:  Modeling the Agile Data Warehouse

Anchor ModelingData Vault ModelingFocal Point Modeling.  To name a few.  In fact there are dozens of data warehouse data modeling patterns that have been introduced over the past decade.  Among the top ones there are a set of defining characteristics.  These characteristics are combined in the definition of Ensemble modeling forms (AKA Data Warehouse Modeling).  See coverage notes on the Next Generation DWH Modeling conference here (and summary here).

The differences between them define the flavors of Ensemble Modeling.  These flavors have vastly more in common than they have differences.  When compared to 3NF or Dimensional modeling, the defining characteristics of the Ensemble forms have an 80/20 rule of commonality. 

  • All these forms practice Unified Decomposition (breaking concepts into component parts) with a central unique instance as a centerstone (Anchor, Hub, Focal Point, etc.). 
  • Each separates context attributes into dedicated table forms that are directly attached to the centerstone. 
  • Each uncouples relationships from the concepts they seek to relate. 
  • Each purposefully manages historical time-slice data with varying degrees of sophistication concerning temporal variations. 
  • Each recognizes the differences between static and dynamic data.
  • Each recognizes the reality of working with constantly changing sources, transformation and rules. 
  • Each recognizes the dichotomy of the enterprise-wide natural business key.

From that foundation of commonality, the various forms of Ensembles begin to take on their own flavors. 

  • While Data Vault is foundationally based on the natural business key as the foundation of the centerstone (Hub), both Anchor and Focal Point center on a unique instance of a concept where the business key is strongly coupled but separate from the centerstone (Anchor, Focal Point). 
  • Both Data Vault and Anchor aim to model Ensembles at the Core Business Concept level while Focal Point tends to deploy slightly more abstracted or generic forms of concepts. 
  • Data Vault and Focal Point utilize forms of attribute clusters (logical combinations) for context while Anchor relies on single attribute context tables. 

     And there are other differentiating factors as well.

There is one thing that we can all agree on: modeling the agile data warehouse means applying some form of Ensemble modeling approach.  The specific flavor choice (Data Vault, Anchor, Focal Point, etc.) should be based on the specific characteristics of your data warehouse program. 

* Learn more on Anchor Modeling with Lars Rönnbäck here: Anchor Modeling

Head and Version Modeling at Trivadis.

Data modeling for the Data Warehouse is different than data modeling for operational systems or for data marts.  See my prior blogs on Unified Decomopsition and Ensemble Modeling for some discussion on common themes and several contemporary modeling patterns.

Several weeks back I held a session at the Trivadis TechEvent in Switzerland.  This was a fantastic event with several concurrent sessions and a very engaged set of professional participants.  Following the event I had the opportunity to chat with a small group from the Trivadis team.  We spoke about agility, modeling, data vault, ensemble and also this Head & Version modeling technique.  See the Trivadis blog entry by Dani Schnider hereIn this blog entry Dani described a bit more about Head and Version modeling as well as a comparison to data vault modeling.

Here I am posting some ERwin models that relate to this discussion and comparison.  First here is a view (disclaimer: I am entirely new to Head and Version so please take this as a draft by a beginner):

Head & Version

Head & Version

What can be seen above here is that the Head tables are much like Hubs except that they include the context attributes determined to be static (maintain a 1:1 relationship with the key).  The Version tables are much like Satellites except that they have their own surrogate keys, they contain end date effectivity and they can contain foreign keys if they have a M:1 relationship with another concept (can only have one value at a given point in time).  If a head table has a static M:1 relationship with another concept, then there can be a foreign key in the Head table as well (see Prod_Sub_Cat_Head has a FK for the Product_Cat_Head).

Here is a similar scope in the data vault modeling pattern:



 Here we can note that the static attributes are in their own Satellite (S_Product_Main) and the dynamic ones are in another Satellite (S_Product_Expanded).  The only thing we find in the Hub is the business key and the surrogate.  Satellites are keyed by the same surrogate as the Hub plus Date/Time Stamp.

Please see the Trivadis blog for more discussion on this topic.

Forward: my book “Modeling the Agile Data Warehouse with Data Vault” available on Amazon in USA, UK, and EU.  You can order your copy here.  This is a data vault data modeling book that also includes related data warehousing topics including some new concepts such as Ensemble Modeling. 

Ensemble modeling is based on the core idea of Unified Decomposition™ (please see my blog post from October 3rd 2012).  Basically this idea recognizes that we want to break things out into component parts for reasons of flexibility, adaptability, agility, and generally to facilitate the capture of things that are either interpreted in different ways or changing independently of each other.  But at the same time data warehousing is about data integration (a common standard view of unified concepts).  So we also want to bring together the context around the core concept. Ensemble Modeling his is a unifying theory.  Unifying from a data modeling landscape perspective – and also unifying as the primary theme of the modeling approach.  With Ensemble Modeling the Core Business Concepts that we define and model are represented as a whole – an ensemble – including all of the component parts that have been broken out.  An Ensemble is based on all things defining a Core Business Concept that can be uniquely and specifically said for one instance of that Concept. 

So an Ensemble is effectively designed in the same way we would design a core Entity in a 3NF model.  We would establish a key based on a natural business key for the Core Business Concept and then include the context attributes that depend on that key, as well as FK relationships to instances of other concept tables relating to that key. 

But unlike Entities and Dimensions, Ensembles are actually collections of integrated lower level constructs – a result of breaking out core concepts into logical component parts.

Ensemble Overview

In the figure above we can see an example of a Core Business Concept “Customer” modeled as an Entity (left circle) versus modeled as an Ensemble (right circle).  As with all Ensemble Modeling forms or variations (Data Vault Modeling, Anchor Modeling, 2G modeling, Focal Point, etc.) the set of component parts together form a concept. 

Consider the definition of ensemble:

All the parts of a thing taken together, so that each part is considered only in relation to the whole.

It is important to note that the Core Business Concept is fully and completely defined in either model.  At the same time, the parts in the Ensemble modeling variation exist only as parts that make up the whole.  So they are not themselves Core Business Concepts but rather lower level components.  In Ensemble Modeling the Ensemble is a representation of a Core Business Concept including all of its parts – the business key, with context and relationships.  This Set or Grouping of all related parts is treated as a singular concept.  Individual parts do not have their own identity and so can only be considered in relation to the whole.

Data Vault modeling embraces this idea.  Note the vaulting and unvaulting in the diagram introduced in my blog entry from September 27, 2012.  Here we can see that when we are vaulting we are practicing unified decomposition – since the core business concept is broken out into component parts but all parts remain circling the business key. 

But data vault is only one of the flavors of Ensemble Modeling.  There are several approaches within this paradigm.  They all share certain core characteristics and one of the fundamental ones is in the high level view of the table constellations that are created. 

Ensemble Flow

Ensemble Flow

In this diagram we can see a simple architectural view of a data warehouse.  Starting from operational source systems and moving to the EDW layer and lastly to the data marts we can see how these concepts fit in.  Why does this ensemble modeling pattern work?  Because when changes need to be tracked, it is more efficient to split out and isolate the changing parts.  For more on Ensemble Modeling and in particular the data vault modeling approach, please order the book Modeling the Agile Data Warehouse with Data Vault. 

© Copyright Hans Hultgren, 2012. All rights reserved. Ensemble Modeling™

Always noted as an exception structure, a transactional link is actually not a link. A table construct that contains a business key, context attributes that describe that key, and relationships in the form of foreign keys is simply not in vault form. This type of table construct would be a closer candidate for a third normal form (3NF) designation. Though that too might be a stretch.

A Transactional Link contains the relationship forming foreign keys (FKs) similar to that of a Link but also includes some set of context attributes (typically metrics, measures) and possibly also business keys from related Hubs as well as some form of transaction key (unique identifier for a specific transaction). It is built out of necessity when the volumes of data and/or burst rate is so high that the standard vaulted form constructs cannot manage (Telco CDRs or Financial Services stock trades for example).

While there are good reasons to deploy this form of construct, I think it is also important to recognize the impact of doing so. Why? Because otherwise we may incorporate it into our standard modeling pattern and make the exception a rule. And it would be odd and potentially risky to use an exception structure by default. But this is easy to fall into. Data vault means we have to think differently. So we can say that vaulting is unconventional and in the same way unvaulting is conventional (it is how we are accustomed to doing things).

How much does this matter? That depends on your specific situation. But it could have a big impact on your EDW. If you are vaulting for agility, for a lower TCO, for easy incremental builds, for ease in absorbing changes (new attributes, sources, rules, etc.) then what is the cost of unvaulting?

Unvaulting is the opposite of vaulting. Let me explain. With data vault we breakout the component parts of a core business concept into their own constructs. Whereas a 3NF Entity keeps together the business key, the context attributes, and the relationships (M:1 relationships) in a single table construct, the data vault modeling representation of the same concept includes separate table constructs for each component part.


Vaulting and Unvaulting

There would be a Hub (business key), Satellites (context attribute) and Links (relationships) all working together to represent the same concept. Breaking the concept into component parts is the process of vaulting. See the above diagram. Now if we take the component parts and put them back together into a single table construct, then we are in effect “unvaulting” that concept. Hence a transactional link is more of an unvaulted construct.

To answer whether this is good or bad is not possible without looking at each unique situation. It could be great and it could be not so great. The only thing we can do here is to help understand why this is an exception and also to make ourselves aware of the impact of what we are doing. Now to understand the impact of unvaulting, one of the first questions might be to ask why we were vaulting in the first place.

We vault to absorb changes without impacting the architecture. So, will the attributes defining a transaction be subject to change in the future?

With vaulting we track slowly changing dimensional attributes easily. So, will the data feeding your transactional link be subject to change? Can any of the context of the transaction be updated by subsequent feeds?

Relationships are split out so can be added or changed easily. So, will the foreign keys making up your transactional link be subject to change in the future?

Well if none of these are an issue, then it could very well be that a transactional link exception structure is the best choice for this case. But there is one more question. Is the volume or burst rate a problem that cannot be solved in any other way? Because if it can, then maybe it makes sense to stay closer to your design pattern.

In any event, there is no form of judgment here on the use of transactional links. Merely information to help us categorize them and to help us make informed decisions concerning our modeling patterns.

Satellites in the data vault modeling approach contain all context (context attributes, descriptive properties, defining characteristics) concerning core business concepts (Hubs) and their relationships.  With EDW deployments they also contain all of the history – all of the time slices that capture changes over time.

Because Satellites can split out the things that can change into separate structures, the impact of changes over time is lessened (isolated to smaller and fewer structures). This concept is high on the list of features that allow data vault EDW deployments to more easily adapt to changes (AKA Agility). 

But as modelers we still need to make design decisions concerning how many Satellites to create and how to split out and distribute attributes among them.  We know from course materials, websites, blogs and other publications that we can split out Satellites by a) rate of change, b) type of data, and c) source system.  But these are just ideas concerning factors we could use in our design.  There are no set rules that guide our decision process outside of these general guidelines.  But that stands to reason.  After all we are data modelers, architects and designers.  We should be modeling, architecting and designing things. 

In this article we want to explore an additional factor to consider while designing Satellites: the average number of Satellites per Hub.  To better analyze this factor we can look to the extremes (bookends) of possibilities:

     One Satellite per Hub      versus      One Attribute per Satellite

In the first extreme we create just one Satellite for every Hub. 

When we do this we have effectively re-created a conformed dimension.  With some differences including these two primary ones:

a)      All Satellites are pre-built to track history (innately accommodate Type-2 deployments)

b)      The Business Key is separated from the Context

If we compare the agility of this structure with that of a conformed dimension we can see that there is little difference between the two.  If there are changing relationships between Hub keys then we are still more agile with data vault modeling but if the changes are primarily to context attributes then the single-Satellite pattern is not significantly more agile than either dimensional or 3NF modeling.  Why?  Because changes to attributes will always require changes to the Satellite.  These changes require modeling, database changes, ETL changes and related testing, documentation and other development lifecycle components. 

On the other end of the spectrum we could also contemplate one attribute per satellite.  This means that each and every attribute would have its own data structure, its own Satellite. 

Actually we could argue that this is really the proper interpretation of 3NF in an EDW.  Because in fact having context dependent on its key in this case is actually having context related to an n+1 part key (where n is the number of parts for the key before history and the 1 we add is date/time stamp).  In terms of context attributes with slowly changing dimensional data, this means each attribute could change independent of the status of any other attribute.  * Note: there is of course the concept of an attribute cluster where a group of attributes do change together if and when they do change

Now from an agility perspective this is a far more adaptable model.  Additional attributes will never cause changes to any existing Satellites.  Note that all work is not avoided even in this case because we still do need to add tables when we get new attributes. 

This concept is inherent to the patterns of Anchor Modeling.  I would encourage you to visit the Anchor Modeling site to learn more (there is also an interactive modeling tool available that you can use to better understand the dynamics of working with Anchor Modeling).

Data Vault Modeling

So what about Data Vault modeling?  What about Satellite design and the number of Satellites?  Well with data vault modeling we have observed that somewhere between 3 and 5 Satellites per Hub appears to be common.

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

Rates of change.  There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.    

Types of Data.  There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

Record Sources.  Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario. 

People often ask if we should make it a rule to just split by record source to avoid this issue.  But this is not a viable rule.  With some Hubs being loaded by 20, 50, even 120 different source systems, the sheer number of Satellites would be staggering.  So no, not a good candidate for a general rule.  In most cases we have a goal of integrating data into a common central structure.  Keeping a structure for each separate source does not help us reach this goal.

Bottom Line

So we have three (3) main factors to consider (differing rates of change, different types of data and also different record sources) each of which is a general consideration and not a specific rule. 

Perhaps the most compelling thing about Satellite design is the degrees of freedom that you have.  You can in fact model across the entire range from one Satellite per Hub to One Attribute per Satellite.  With either extreme you are still vaulting. 

The most important thing about Satellite design is to entrust the design process to a qualified data vault data modeler.  One who understands the implications of the design decisions and the modeling patterns applied.

What will data vault be in 2020?  Since we are dealing with the future I am going to avoid exact predictions and instead forward three (3) lines of thought for us to contemplate.

Thought 1) Adoption curves in the DWBI landscape.

So if we look at how all of DWBI went from theory to early adopters and to common practice, we can see a general pattern with about a 10-15 year adoption curve.  Some international regions tend to pick things up early – as do certain industries.  This was true also for Dimensional Modeling within the DW space.  For data vault modeling, the leading region globally in terms of adoption is the Netherlands.  Today the hot zones around the world point to early adoption being complete within a couple years and the next 3-5 major regions likely to be in full adoption within that same timeframe (Nordics, Western Europe, Down Under, and perhaps stronger regional centers in the US).

No surprise to local readers of this blog, but this move is actually well underway today.  Putting data vault into major regions with common adoption at this 2016-2017 time frame means that data vault modeling will likely be the commonly accepted leading modeling approach for data warehousing in 2020 throughout the world.  While people will of course utilize other forms of modeling, data vault modeling would be seen as the standard approach (just as 3NF is today for operational systems and Dimensional modeling is today for data marts).

Thought 2) Data Vault modeling will be different in 2020.

A couple of students from a recent CDVDM class in Eindhoven were familiar with the progress and the history of data vault in the Netherlands.  As we worked on the interactive modeling exercises some of this historical background helped to identify things that have changed over the years.  While some things were simply different now there were also changes in the general emphasis of particular features.  We did always know but it was perhaps not as strongly presented that you can’t build an EDW central data model by focusing your analysis on a couple of source systems.  Also, the effort involved in the definition of the central integration point with common enterprise semantics is nothing short of, well, huge.  There are also many factors that have to do with the changing dynamics in the industry as well.  The move to operational data warehousing, unstructured data integration, huge data volumes, and the emphasis on agility have caused the entire industry to re-think what they are doing (and to react).

The point is that over the next 7-8 years we can expect that the changing industry dynamics, along with new business requirements, and lessons learned from the field will cause us to re-think components of data vault modeling.  Could there be a new standard construct?  Will agility pressures drive modeling pattern changes for even greater adaptability?  Will operationally governed processes coexist on our same platform causing a ripple effect to the role of the business data vault?  Since we can all – even today – relate to some or all of these examples, I think we can agree that it is sound reasoning to assume something will change by 2020.

Thought 3) None of this will matter since we won’t be modeling in 2020.

Krish Krishnan (among others) began the mid 2000’s working with what Bill Inmon had already started doing; unstructured data ETL.  Back then it was something you still had to convince people to try and understand.  Today you can’t swing a dead cat without hitting somebody talking about the broader topic of Big Data.  To avoid the potential semantic gap with our industry’s latest top abused term… Big Data means any [forms or volumes or burst rates or rates of change] of data that we can’t readily address using current approaches, tools and techniques. From a current day and practical perspective that means unstructured text, semistructured, NVP/KVP, Doc Style data, XML and etc. (with the future looking images, sound, video, etc. for another day).  To address these things today we mainly call to NoSQL, Hadoop, and doc style (MongoDB) solutions.  As Krish Krishnan puts it, we are here working with schema-on-read versus schema-on-write.

Schema-less. Model-less. AKA, no data modeling.  But of course there has to be a way to understand this data.  So we eventually must somehow parse and index and correlate and integrate this data.  At least for the data that we are going to use.  And these activities from parsing to integrating still need us to define some central semantics and meaning.  Well so at least the information modeling must occur to some level before we can use this data.

This is all interesting for the 80% of data that we estimate is in unstructured or semistructured format.  But what of the 20% of data that is already in a structured format?  Well you can relax because even though the concept of multi-structured data would theoretically encompass both, as of today all of the major vendors (Teradata, Oracle, IBM, Microsoft) have published target architectures that include both a traditional EDW component and a Big Data component working side by side (or over/under, or upstream/downstream).

But wait, we are now talking about 2020.  Could this change in 7-8 years?  We have heard it asked more than once: Is data modeling dead?  I suppose the question here is will data modeling be dead in 2020?  But as we consider the future of data modeling we should not forget that information used at a central level must be integrated around something. That something will require a level of central meaning and common semantics (remember we already discussed above that this is huge).  When the integration happens, however it happens, it will have to deal with sources that don’t share the same view.  To deal with this, we will need some kind of model.  A flexible and agile model…

Back to the Krish Krishnan statement about schema-on-read; there is still a schema.  By removing the schema as a pre-requisite to a write in our data warehouse we are effectively doing the same thing as we do when we create highly generic physical models – we are separating the body from the head.  We still need a head if we want to use it.  Whether this head is a semantic model, logical model, conceptual model, information model, or other form that captures our central view and meaning, data vault modeling techniques might indeed be found to meet these needs.

What does 2020 hold for data vault?  I suppose the only thing we do know is that it will be different from today.  That and we also know that we will all be party to what does happen.  So above are three trains of thought, none of which are mutually exclusive of the other, but all of which have already left the station. Let’s enjoy the ride.

EDW: All Data is Unstructured

While I was preparing to speak at Bill Inmon’s Advanced Architecture conference the other day I had a couple of epiphanies or “sudden realizations of great truth”… two of them in fact.

They both deal with enterprise data warehousing (EDW); meaning that they relate to real data warehousing including the classic characteristics of integrated, non-volatile, time-variant and etc.  So here is the first one:

1)      All data is Unstructured to an EDW.

To be more specfic, all of the data that we source into the EDW can be considered unstructured.  For all of you who just said “unstructured data is a misnomer, all data has some structure” – please use semi-structured or multi-structured instead.  For now let’s use the label of n-structured for the superset of these categories.

The concept of “structure” in the world of data effectively translates to context, consistency and predictability.  A structured source of data has table definitions that contain attributes with field definitions and some representation of context.  In this way we have predictability (we know what to expect), there is consistency (all data for this source arrives in the same structures as defined), and we have some level of associated context (from the simple association of entity keys defined by their attributes to more comprehensive metadata and possibly domain values for validation). 

The contemporary concepts of n-structured data stem from the idea of working with data that somehow does not fit the above description of structured data.  This is to say that this broad category of data falls short somewhere among the concepts of context, consistency and predictability.  To carry this further, this data may not have table definitions with set attributes and field definitions.  We often don’t know what to expect, the data does not arrive consistently, and there is little to no associated context.  Examples include text blobs (contracts, emails, doctor notes, call logs, blogs, social media feeds, etc.), multi-media files (scans, images, videos, sound files, etc.), as well as key-value pair or name-value pair (KVP, NVP) data, XSD-free XML feeds, and other similar types.      

We recognize this type of data exists and that it should also be included in the scope of our data warehouse.  But the assertion here is that all data is Unstructured to an EDW.

Consider that an EDW, by design, a) integrates data from multiple different sources, and also b) maintans the history associated with this data.  We know that the source systems do not share the same structures or context.  We also know that source systems will change over time.  So when the sources are contemplated together, and over time, they do not have context, consistency and predictability.  Since there are changes over time, all of the source data does not have consistent table definitions with set attributes and field definitions.  We don’t know what to expect over time, data does not arrive consistently, and there is little to no associated enterprsie-wide context. 

So, from several disparate systems, and over time, data is not structured.  Since the EDW integrates data from several disparate systems, and maintains history over time, the EDW sees all data as not structured.  In this regard, all data is Unstructured to an EDW.

2)      With an EDW, data integration is impossible.

One of the core concepts of data warehousing is data integration.  To this point nearly everyone in the industry will agree.  Data integration implies that we put data together with like data so that we can support a higher level, central view of the data.  But there is a problem.  To integrate data we need to integrate around something – some form of integration point.  With an EDW, the integration point is some form of central, enterprise-wide concept or key.  This enterprise-wide key should then represent the enterprise view of that concept or key.  In other words, the integration point is not source-system centric, is not department centric but rather is organization-wide centric. 

These should then be consistent with the ongoing MDM initiatives, business glossary, and other data centralization initiatives.  The problem is that no such initiatives have been fully completed and adopted in any company.  In fact we can expect that true semantic integration at the organizational level will never happen. 

So if we don’t have a defined integration point, we can’t integrate to it.  Which means that in an EDW, data integration is impossible

Then why do we continue to try and integrate data in the EDW?  There are two answers to this.  First, we can do our best to integrate data around keys that have already been defined, and second, we can target a expanded concept of integration, alignment and reconciliation.  This second point implies that we integrate where possible, then align keys where they remain separated, and at the same time provide for the ability to reconcile these differences.

If we put off the data warehouse until all central meaning has been defined and adopted then we will never have a data warehouse.  So by adopting this concept of integrate / align / reconcile we can start now and be part of the process to move towards central context and meaning.

And I believe that this is our charge.  We should be part of the process to move towards an integrated and centralized view of an organizations data.  At the same time however we should recognize that the end goal is not to achieve fully integrated data.  But rather data that is integrated to the extent possible, aligned so that we can contrast and understand the differences, and reconciled so that we can meet the needs of both the departments and the enterprise from a trusted and auditable EDW.