Satellites in the data vault modeling approach contain all context (context attributes, descriptive properties, defining characteristics) concerning core business concepts (Hubs) and their relationships. With EDW deployments they also contain all of the history – all of the time slices that capture changes over time.
Because Satellites can split out the things that can change into separate structures, the impact of changes over time is lessened (isolated to smaller and fewer structures). This concept is high on the list of features that allow data vault EDW deployments to more easily adapt to changes (AKA Agility).
But as modelers we still need to make design decisions concerning how many Satellites to create and how to split out and distribute attributes among them. We know from course materials, websites, blogs and other publications that we can split out Satellites by a) rate of change, b) type of data, and c) source system. But these are just ideas concerning factors we could use in our design. There are no set rules that guide our decision process outside of these general guidelines. But that stands to reason. After all we are data modelers, architects and designers. We should be modeling, architecting and designing things.
In this article we want to explore an additional factor to consider while designing Satellites: the average number of Satellites per Hub. To better analyze this factor we can look to the extremes (bookends) of possibilities:
One Satellite per Hub versus One Attribute per Satellite
In the first extreme we create just one Satellite for every Hub.
When we do this we have effectively re-created a conformed dimension. With some differences including these two primary ones:
a) All Satellites are pre-built to track history (innately accommodate Type-2 deployments)
b) The Business Key is separated from the Context
If we compare the agility of this structure with that of a conformed dimension we can see that there is little difference between the two. If there are changing relationships between Hub keys then we are still more agile with data vault modeling but if the changes are primarily to context attributes then the single-Satellite pattern is not significantly more agile than either dimensional or 3NF modeling. Why? Because changes to attributes will always require changes to the Satellite. These changes require modeling, database changes, ETL changes and related testing, documentation and other development lifecycle components.
On the other end of the spectrum we could also contemplate one attribute per satellite. This means that each and every attribute would have its own data structure, its own Satellite.
Actually we could argue that this is really the proper interpretation of 3NF in an EDW. Because in fact having context dependent on its key in this case is actually having context related to an n+1 part key (where n is the number of parts for the key before history and the 1 we add is date/time stamp). In terms of context attributes with slowly changing dimensional data, this means each attribute could change independent of the status of any other attribute. * Note: there is of course the concept of an attribute cluster where a group of attributes do change together if and when they do change.
Now from an agility perspective this is a far more adaptable model. Additional attributes will never cause changes to any existing Satellites. Note that all work is not avoided even in this case because we still do need to add tables when we get new attributes.
This concept is inherent to the patterns of Anchor Modeling. I would encourage you to visit the Anchor Modeling site to learn more (there is also an interactive modeling tool available that you can use to better understand the dynamics of working with Anchor Modeling).
Data Vault Modeling
So what about Data Vault modeling? What about Satellite design and the number of Satellites? Well with data vault modeling we have observed that somewhere between 3 and 5 Satellites per Hub appears to be common.
From an agility perspective, this approach takes us several steps in the right direction. Things that change frequently are grouped into a common structure. This limits the impact of changes during the lifecycle of the data warehouse. The 3 to 5 Satellites above are commonly distinguished from each other based on differing rates of change, different types of data and also different record sources. Let’s take a brief look at each one of these considerations:
Rates of change. There will inevitably be some attributes that change over time more frequently than other attributes. When we apply this criteria to modeling Satellites we need to consider the 80/20 or even the 90/10 rule. That is to say we generally group rapidly changing attributes together and then leave the rest. We may even have three categories of rates of change – Rapid, Sometimes, Never – for example. But if we analyze too much, try to use too much precision in our decision, then we may find ourselves gravitating towards the left of this continuum, towards the One Attribute per Satellite.
Types of Data. There are natural delineations or groupings concerning the types of context data that we have available to describe Hubs. So groups of attributes may fall into types such as profile attributes, descriptive attributes, tracking attributes, status attributes, base attributes, geographic attributes, physical attributes, classification attributes, etc. There are no hard and fast rules on these groupings. Each should become apparent when you as the designer are working with the business to define these Hub keys.
Record Sources. Even though we are designing the EDW data vault based on the central view (specifically not based on source systems or departmental views) we do have groupings of attributes that come from certain areas within the organization. Often this is in line with the design we were contemplating based on the Types of Data factor. This is because certain groupings of Types of Data tend to also be related to specific business processes and so also certain source systems. As a first step then we should look to see how in line these two factors are for our given EDW scenario.
People often ask if we should make it a rule to just split by record source to avoid this issue. But this is not a viable rule. With some Hubs being loaded by 20, 50, even 120 different source systems, the sheer number of Satellites would be staggering. So no, not a good candidate for a general rule. In most cases we have a goal of integrating data into a common central structure. Keeping a structure for each separate source does not help us reach this goal.
So we have three (3) main factors to consider (differing rates of change, different types of data and also different record sources) each of which is a general consideration and not a specific rule.
Perhaps the most compelling thing about Satellite design is the degrees of freedom that you have. You can in fact model across the entire range from one Satellite per Hub to One Attribute per Satellite. With either extreme you are still vaulting.
The most important thing about Satellite design is to entrust the design process to a qualified data vault data modeler. One who understands the implications of the design decisions and the modeling patterns applied.