Archive

Uncategorized

Logo_Long_Orange_P

2-3 October 2017 in Golden Colorado USA will be the first annual Global Data Summit. I’m really excited about this new conference for many reasons, not the least of which that it is being held in my backyard! The Global Data Summit – or GDS – has a new format putting DATA at the center and Business Focus at the forefront. This means the GDS is designed with an equal emphasis on A) business users who leverage data, and B) data professionals who manage data. So expect to find data scientists, Chief Data Officers, healthcare professionals, marketing analysts, data modelers, cloud architects and business analysts all in the same place.

Joining the GDS from around the globe are Mike Ferguson UK, Richard Hackathorn USA, Rick van der Lans The Netherlands, Stephen Brobst The Universe, Patrik Lager Sweden, Dirk Lerner Germany, Brian Bradley New Zealand, Thomas Janssen Australia, Tjaart Riekert, South Africa, and many more.

There is a new format that includes more time for active discussions and interactions between attendees, speakers and sponsors. In these daily schedules below, note the green sections which are the times dedicated to interactive discussions – over three and a half (3.5) hours a day are allocated to this time. This means you can explore your topics of interest with the international experts in the field and the vendors who offer products and services that you are considering.

Overview of Tracks:

Tracks_P

You can see the speaker line up as well as the logistics and registration information on the GDS website GlobalDataSummit.com.

 

 

Data Modeling Tools

To support our data modeling functions, from conceptual to logical to physical, there are a set of commercially available data modeling tools. Depending on which set of tasks you intend to entrust to the tooling, the more important it may become to have the deeper functionality of the higher end tools. For example, if you are planning to collaborate on models with a broader team, to register repositories of models for production databases, or to maintain master metadata concerning your enterprise databases then you will require a higher end data modeling tool.

For our team the main purpose of the modeling tool is to review data models that capture, manage and communicate conceptual and logical models. We secondarily concern ourselves with the physical model however this is mainly for the purpose of reviewing the initial MVP.

Even so I have always recommended, and continue to recommend erwin Data Modeler as my preferred data modeling tool. In my experience the higher-end features have not compromised the tool’s ease of use or innate ability to communicate. So while there are other tools on the high-end scale (Embarcadero, PowerDesigner, etc.) and a pool of mid-range or entry level tools (Toad, Oracle SQL Dev DM, 3D, Visio, Aqua, etc.) none have proven to be as effective and usable as erwin.

Ensemble Modeling Forms:  Modeling the Agile Data Warehouse

Anchor ModelingData Vault ModelingFocal Point Modeling.  To name a few.  In fact there are dozens of data warehouse data modeling patterns that have been introduced over the past decade.  Among the top ones there are a set of defining characteristics.  These characteristics are combined in the definition of Ensemble modeling forms (AKA Data Warehouse Modeling).  See coverage notes on the Next Generation DWH Modeling conference here (and summary here).

The differences between them define the flavors of Ensemble Modeling.  These flavors have vastly more in common than they have differences.  When compared to 3NF or Dimensional modeling, the defining characteristics of the Ensemble forms have an 80/20 rule of commonality. 

  • All these forms practice Unified Decomposition (breaking concepts into component parts) with a central unique instance as a centerstone (Anchor, Hub, Focal Point, etc.). 
  • Each separates context attributes into dedicated table forms that are directly attached to the centerstone. 
  • Each uncouples relationships from the concepts they seek to relate. 
  • Each purposefully manages historical time-slice data with varying degrees of sophistication concerning temporal variations. 
  • Each recognizes the differences between static and dynamic data.
  • Each recognizes the reality of working with constantly changing sources, transformation and rules. 
  • Each recognizes the dichotomy of the enterprise-wide natural business key.

From that foundation of commonality, the various forms of Ensembles begin to take on their own flavors. 

  • While Data Vault is foundationally based on the natural business key as the foundation of the centerstone (Hub), both Anchor and Focal Point center on a unique instance of a concept where the business key is strongly coupled but separate from the centerstone (Anchor, Focal Point). 
  • Both Data Vault and Anchor aim to model Ensembles at the Core Business Concept level while Focal Point tends to deploy slightly more abstracted or generic forms of concepts. 
  • Data Vault and Focal Point utilize forms of attribute clusters (logical combinations) for context while Anchor relies on single attribute context tables. 

     And there are other differentiating factors as well.

There is one thing that we can all agree on: modeling the agile data warehouse means applying some form of Ensemble modeling approach.  The specific flavor choice (Data Vault, Anchor, Focal Point, etc.) should be based on the specific characteristics of your data warehouse program. 

* Learn more on Anchor Modeling with Lars Rönnbäck here: Anchor Modeling

Satellites in the data vault modeling approach contain all context (context attributes, descriptive properties, defining characteristics) concerning core business concepts (Hubs) and their relationships.  With EDW deployments they also contain all of the history – all of the time slices that capture changes over time.

Because Satellites can split out the things that can change into separate structures, the impact of changes over time is lessened (isolated to smaller and fewer structures). This concept is high on the list of features that allow data vault EDW deployments to more easily adapt to changes (AKA Agility). 

But as modelers we still need to make design decisions concerning how many Satellites to create and how to split out and distribute attributes among them.  We know from course materials, websites, blogs and other publications that we can split out Satellites by a) rate of change, b) type of data, and c) source system.  But these are just ideas concerning factors we could use in our design.  There are no set rules that guide our decision process outside of these general guidelines.  But that stands to reason.  After all we are data modelers, architects and designers.  We should be modeling, architecting and designing things. 

In this article we want to explore an additional factor to consider while designing Satellites: the average number of Satellites per Hub.  To better analyze this factor we can look to the extremes (bookends) of possibilities:

     One Satellite per Hub      versus      One Attribute per Satellite

In the first extreme we create just one Satellite for every Hub. 

When we do this we have effectively re-created a conformed dimension.  With some differences including these two primary ones:

a)      All Satellites are pre-built to track history (innately accommodate Type-2 deployments)

b)      The Business Key is separated from the Context

If we compare the agility of this structure with that of a conformed dimension we can see that there is little difference between the two.  If there are changing relationships between Hub keys then we are still more agile with data vault modeling but if the changes are primarily to context attributes then the single-Satellite pattern is not significantly more agile than either dimensional or 3NF modeling.  Why?  Because changes to attributes will always require changes to the Satellite.  These changes require modeling, database changes, ETL changes and related testing, documentation and other development lifecycle components. 

On the other end of the spectrum we could also contemplate one attribute per satellite.  This means that each and every attribute would have its own data structure, its own Satellite. 

Actually we could argue that this is really the proper interpretation of 3NF in an EDW.  Because in fact having context dependent on its key in this case is actually having context related to an n+1 part key (where n is the number of parts for the key before history and the 1 we add is date/time stamp).  In terms of context attributes with slowly changing dimensional data, this means each attribute could change independent of the status of any other attribute.  * Note: there is of course the concept of an attribute cluster where a group of attributes do change together if and when they do change

Now from an agility perspective this is a far more adaptable model.  Additional attributes will never cause changes to any existing Satellites.  Note that all work is not avoided even in this case because we still do need to add tables when we get new attributes. 

This concept is inherent to the patterns of Anchor Modeling.  I would encourage you to visit the Anchor Modeling site to learn more (there is also an interactive modeling tool available that you can use to better understand the dynamics of working with Anchor Modeling).

Data Vault Modeling

So what about Data Vault modeling?  What about Satellite design and the number of Satellites?  Well with data vault modeling we have observed that somewhere between 3 and 5 Satellites per Hub appears to be common.

From an agility perspective, this approach takes us several steps in the right direction.  Things that change frequently are grouped into a common structure.  This limits the impact of changes during the lifecycle of the data warehouse.  The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources.  Let’s take a brief look at each one of these considerations:

Rates of change.  There will inevitably be some attributes that change over time more frequently than other attributes.  When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule.  That is to say we generally group rapidly changing attributes together and then leave the rest.  We may even have three categories of rates of change – Rapid, Sometimes, Never – for example.  But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.    

Types of Data.  There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs.  So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc.  There are no hard and fast rules on these groupings.  Each should become apparent when you as the designer are working with the business to define these Hub keys.

Record Sources.  Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization.  Often this is in line with the design we were contemplating based on the Types of Data factor.  This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems.  As a first step then we should look to see how in line these two factors are for our given EDW scenario. 

People often ask if we should make it a rule to just split by record source to avoid this issue.  But this is not a viable rule.  With some Hubs being loaded by 20, 50, even 120 different source systems, the sheer number of Satellites would be staggering.  So no, not a good candidate for a general rule.  In most cases we have a goal of integrating data into a common central structure.  Keeping a structure for each separate source does not help us reach this goal.

Bottom Line

So we have three (3) main factors to consider (differing rates of change, different types of data and also different record sources) each of which is a general consideration and not a specific rule. 

Perhaps the most compelling thing about Satellite design is the degrees of freedom that you have.  You can in fact model across the entire range from one Satellite per Hub to One Attribute per Satellite.  With either extreme you are still vaulting. 

The most important thing about Satellite design is to entrust the design process to a qualified data vault data modeler.  One who understands the implications of the design decisions and the modeling patterns applied.

This is a quick overview of the Data Vault Core Components

The data vault consists of three core components, the Hub, Link and Satellite.  While we are all discussing certain exceptions and details around Data Vault deployments, I thought it would be useful to “get back to basics” on the main building blocks.  In the end, committing to these constructs in the most pure form should be our main goal.

The Hub represents a Business Key and is established the first time a new instance of that business key is introduced to the EDW.  It may require a multiple part key to assure an enterprise wide unique key however the cardinality of the Hub must be 1:1 with the true business key.  The Hub contains no descriptive information and contains no FKs.  The Hub consists of the business key only, with a warehouse machine sequence id, a load date/time stamp and a record source.

A Link represents an association between business keys and is established the first time this new unique association is presented to the EDW.  It can represent an association between several Hubs and other Links.  It does maintain a 1:1 relationship with the business defined association between that set of keys.  Just like the Hub, it contains no descriptive information.  The Link consists of the sequence ids from the Hubs and Links that it is relating only, with a warehouse machine sequence id, a load date/time stamp and a record source.

The Satellite contains the descriptive information (context) for a business key.  There can be several Satellites used to describe a single business key (or association of keys) however a Satellite can only describe one key (Hub or a Link).  There is a good amount of flexibility afforded the modelers in how they design and build Satellites.  Common approaches include using the subject area, rate of change, source system, or type of data to split out context and design the Satellites. The Satellite is keyed by the sequence id from the Hub or Link to which it is attached plus the date/time stamp to form a two part key. 

Note that the Satellite then is the only construct that manages time slice data (data warehouse historical tracking of values over time). 

These three constructs are the building blocks for the DV EDW.  Together they can be used to represent all integrated data from the organization.  The Hubs are the business keys, the Links represent all relationships and the Satellites provide all the context and changes over time.

Concerning the various layers that we describe as part of the data warehouse architecture.  Basically, this is MDM for the business terms used in the data warehousing industry. 

Ronald Damhof wrote some excellent posts on his blog http://tinyurl.com/6huc2ps and kicked off a much needed discussion on this topic.  The layers that we actually deploy in dv-based data warehouses today – given a common categorization – probably don’t vary as much as we think.  In any case, they would probably all fit into a small set of valid variations. 

The focus of this discussion has been the Raw Data Vault or Raw Vault.  And in fact the term has been used in connection with different meanings.  As Ronald points out, the Raw Vault from an “auto-generated” perspective is of limited value in the overall DV architecture as it is (just as the sources are) on the left side of the semantic gap.  The Data Vault is based on Business Keys (Hubs) and these are defined by the business.  The keys we derive from the sources are not these same keys. 

The Raw Vault from a perspective of an “auditable, DV-based, Business Key aligned” perspective represents a layer that moves towards the semantic integration but does so only to the point that it can remain traceable back to the sources without the need for soft business rules.  Because the DV based data warehouse is charged with auditability diligence (an integrated mirror of the sources), this layer needs to be persisted before soft business rules are applied.

So the layers that we work with, now going back to Ronald’s blog, include (1) Staging – either persistent or not, (2) Data Vault, (3) Staging out/EDW+/CDW+, (4) datamarts.  Let’s look at each one a bit more closely:

(1) Staging.  Persisted or Not.  This layer is a copy of sources primarily utilized for supporting the process of moving data from various sources to the data warehouse.  This layer is 1:1 with the source systems, typically in the same format as the sources, has no rules applied, and is commonly not persisted.  Alias: System of Record, SoR, and Stage-in. 

(2) Data Vault.  The core historized layer, aligned with business keys (to the extent possible), all in-scope data is loaded and persisted, auditability is maintained.  At the heart of all data warehousing is integration – and this layer integrated data from multiple sources around the enterprise-wide business keys.  Alias: EDW, CDW, Raw Vault, data warehouse.

(3) Staging out/EDW+/CDW+.  This layer represents the data following the application of the soft business rules that may be required for a) the alignment with the business keys, and b) for common transformations required by the enterprise.  This layer makes the final move to the enterprise-wide business view, including gold-record designations, business driven sub-typing, classifications, categorizations and alignment with reference models.  Alias: EDW, CDW, Business Data Vault (BDV), Business Data Warehouse (BDW), and Mart Stage.

(4) Data Marts.  These represent the presentation layer and are intended to be requirements-driven, scope specific, subsets of the data warehouse data that can be regenerated (so typically not persisted).  With the DV data warehouse approach, these are very flexible and NOT restricted by federated mart paradigms (limited to dimensional models, maintaining conformed dimensions, persistance, etc.).    While this layer tends to be mainly deployed using dimensional modeling, marts can also be flat files, xml, and other structures.  Alias: DM, Marts. 

A quick look at the Alias possibilities and we can see that the terms are not universally applied.  However, the delineation of the core layers represents a common understanding.  Deployments that differ from this common view represent exceptions.  In most cases, these exceptions are valid and represent acceptable and encouraged data warehousing practices.  Of course they could also represent an issue that will cause problems in the long run.  But by having a standard understanding of the core layers, we will continue to have a reference point for our analysis and discussions.