What will data vault be in 2020? Since we are dealing with the future I am going to avoid exact predictions and instead forward three (3) lines of thought for us to contemplate.
Thought 1) Adoption curves in the DWBI landscape.
So if we look at how all of DWBI went from theory to early adopters and to common practice, we can see a general pattern with about a 10-15 year adoption curve. Some international regions tend to pick things up early – as do certain industries. This was true also for Dimensional Modeling within the DW space. For data vault modeling, the leading region globally in terms of adoption is the Netherlands. Today the hot zones around the world point to early adoption being complete within a couple years and the next 3-5 major regions likely to be in full adoption within that same timeframe (Nordics, Western Europe, Down Under, and perhaps stronger regional centers in the US).
No surprise to local readers of this blog, but this move is actually well underway today. Putting data vault into major regions with common adoption at this 2016-2017 time frame means that data vault modeling will likely be the commonly accepted leading modeling approach for data warehousing in 2020 throughout the world. While people will of course utilize other forms of modeling, data vault modeling would be seen as the standard approach (just as 3NF is today for operational systems and Dimensional modeling is today for data marts).
Thought 2) Data Vault modeling will be different in 2020.
A couple of students from a recent CDVDM class in Eindhoven were familiar with the progress and the history of data vault in the Netherlands. As we worked on the interactive modeling exercises some of this historical background helped to identify things that have changed over the years. While some things were simply different now there were also changes in the general emphasis of particular features. We did always know but it was perhaps not as strongly presented that you can’t build an EDW central data model by focusing your analysis on a couple of source systems. Also, the effort involved in the definition of the central integration point with common enterprise semantics is nothing short of, well, huge. There are also many factors that have to do with the changing dynamics in the industry as well. The move to operational data warehousing, unstructured data integration, huge data volumes, and the emphasis on agility have caused the entire industry to re-think what they are doing (and to react).
The point is that over the next 7-8 years we can expect that the changing industry dynamics, along with new business requirements, and lessons learned from the field will cause us to re-think components of data vault modeling. Could there be a new standard construct? Will agility pressures drive modeling pattern changes for even greater adaptability? Will operationally governed processes coexist on our same platform causing a ripple effect to the role of the business data vault? Since we can all – even today – relate to some or all of these examples, I think we can agree that it is sound reasoning to assume something will change by 2020.
Thought 3) None of this will matter since we won’t be modeling in 2020.
Krish Krishnan (among others) began the mid 2000’s working with what Bill Inmon had already started doing; unstructured data ETL. Back then it was something you still had to convince people to try and understand. Today you can’t swing a dead cat without hitting somebody talking about the broader topic of Big Data. To avoid the potential semantic gap with our industry’s latest top abused term… Big Data means any [forms or volumes or burst rates or rates of change] of data that we can’t readily address using current approaches, tools and techniques. From a current day and practical perspective that means unstructured text, semistructured, NVP/KVP, Doc Style data, XML and etc. (with the future looking images, sound, video, etc. for another day). To address these things today we mainly call to NoSQL, Hadoop, and doc style (MongoDB) solutions. As Krish Krishnan puts it, we are here working with schema-on-read versus schema-on-write.
Schema-less. Model-less. AKA, no data modeling. But of course there has to be a way to understand this data. So we eventually must somehow parse and index and correlate and integrate this data. At least for the data that we are going to use. And these activities from parsing to integrating still need us to define some central semantics and meaning. Well so at least the information modeling must occur to some level before we can use this data.
This is all interesting for the 80% of data that we estimate is in unstructured or semistructured format. But what of the 20% of data that is already in a structured format? Well you can relax because even though the concept of multi-structured data would theoretically encompass both, as of today all of the major vendors (Teradata, Oracle, IBM, Microsoft) have published target architectures that include both a traditional EDW component and a Big Data component working side by side (or over/under, or upstream/downstream).
But wait, we are now talking about 2020. Could this change in 7-8 years? We have heard it asked more than once: Is data modeling dead? I suppose the question here is will data modeling be dead in 2020? But as we consider the future of data modeling we should not forget that information used at a central level must be integrated around something. That something will require a level of central meaning and common semantics (remember we already discussed above that this is huge). When the integration happens, however it happens, it will have to deal with sources that don’t share the same view. To deal with this, we will need some kind of model. A flexible and agile model…
Back to the Krish Krishnan statement about schema-on-read; there is still a schema. By removing the schema as a pre-requisite to a write in our data warehouse we are effectively doing the same thing as we do when we create highly generic physical models – we are separating the body from the head. We still need a head if we want to use it. Whether this head is a semantic model, logical model, conceptual model, information model, or other form that captures our central view and meaning, data vault modeling techniques might indeed be found to meet these needs.
What does 2020 hold for data vault? I suppose the only thing we do know is that it will be different from today. That and we also know that we will all be party to what does happen. So above are three trains of thought, none of which are mutually exclusive of the other, but all of which have already left the station. Let’s enjoy the ride.