Archive

Tag Archives: bigdata

What will data vault be in 2020?  Since we are dealing with the future I am going to avoid exact predictions and instead forward three (3) lines of thought for us to contemplate.

Thought 1) Adoption curves in the DWBI landscape.

So if we look at how all of DWBI went from theory to early adopters and to common practice, we can see a general pattern with about a 10-15 year adoption curve.  Some international regions tend to pick things up early – as do certain industries.  This was true also for Dimensional Modeling within the DW space.  For data vault modeling, the leading region globally in terms of adoption is the Netherlands.  Today the hot zones around the world point to early adoption being complete within a couple years and the next 3-5 major regions likely to be in full adoption within that same timeframe (Nordics, Western Europe, Down Under, and perhaps stronger regional centers in the US).

No surprise to local readers of this blog, but this move is actually well underway today.  Putting data vault into major regions with common adoption at this 2016-2017 time frame means that data vault modeling will likely be the commonly accepted leading modeling approach for data warehousing in 2020 throughout the world.  While people will of course utilize other forms of modeling, data vault modeling would be seen as the standard approach (just as 3NF is today for operational systems and Dimensional modeling is today for data marts).

Thought 2) Data Vault modeling will be different in 2020.

A couple of students from a recent CDVDM class in Eindhoven were familiar with the progress and the history of data vault in the Netherlands.  As we worked on the interactive modeling exercises some of this historical background helped to identify things that have changed over the years.  While some things were simply different now there were also changes in the general emphasis of particular features.  We did always know but it was perhaps not as strongly presented that you can’t build an EDW central data model by focusing your analysis on a couple of source systems.  Also, the effort involved in the definition of the central integration point with common enterprise semantics is nothing short of, well, huge.  There are also many factors that have to do with the changing dynamics in the industry as well.  The move to operational data warehousing, unstructured data integration, huge data volumes, and the emphasis on agility have caused the entire industry to re-think what they are doing (and to react).

The point is that over the next 7-8 years we can expect that the changing industry dynamics, along with new business requirements, and lessons learned from the field will cause us to re-think components of data vault modeling.  Could there be a new standard construct?  Will agility pressures drive modeling pattern changes for even greater adaptability?  Will operationally governed processes coexist on our same platform causing a ripple effect to the role of the business data vault?  Since we can all – even today – relate to some or all of these examples, I think we can agree that it is sound reasoning to assume something will change by 2020.

Thought 3) None of this will matter since we won’t be modeling in 2020.

Krish Krishnan (among others) began the mid 2000’s working with what Bill Inmon had already started doing; unstructured data ETL.  Back then it was something you still had to convince people to try and understand.  Today you can’t swing a dead cat without hitting somebody talking about the broader topic of Big Data.  To avoid the potential semantic gap with our industry’s latest top abused term… Big Data means any [forms or volumes or burst rates or rates of change] of data that we can’t readily address using current approaches, tools and techniques. From a current day and practical perspective that means unstructured text, semistructured, NVP/KVP, Doc Style data, XML and etc. (with the future looking images, sound, video, etc. for another day).  To address these things today we mainly call to NoSQL, Hadoop, and doc style (MongoDB) solutions.  As Krish Krishnan puts it, we are here working with schema-on-read versus schema-on-write.

Schema-less. Model-less. AKA, no data modeling.  But of course there has to be a way to understand this data.  So we eventually must somehow parse and index and correlate and integrate this data.  At least for the data that we are going to use.  And these activities from parsing to integrating still need us to define some central semantics and meaning.  Well so at least the information modeling must occur to some level before we can use this data.

This is all interesting for the 80% of data that we estimate is in unstructured or semistructured format.  But what of the 20% of data that is already in a structured format?  Well you can relax because even though the concept of multi-structured data would theoretically encompass both, as of today all of the major vendors (Teradata, Oracle, IBM, Microsoft) have published target architectures that include both a traditional EDW component and a Big Data component working side by side (or over/under, or upstream/downstream).

But wait, we are now talking about 2020.  Could this change in 7-8 years?  We have heard it asked more than once: Is data modeling dead?  I suppose the question here is will data modeling be dead in 2020?  But as we consider the future of data modeling we should not forget that information used at a central level must be integrated around something. That something will require a level of central meaning and common semantics (remember we already discussed above that this is huge).  When the integration happens, however it happens, it will have to deal with sources that don’t share the same view.  To deal with this, we will need some kind of model.  A flexible and agile model…

Back to the Krish Krishnan statement about schema-on-read; there is still a schema.  By removing the schema as a pre-requisite to a write in our data warehouse we are effectively doing the same thing as we do when we create highly generic physical models – we are separating the body from the head.  We still need a head if we want to use it.  Whether this head is a semantic model, logical model, conceptual model, information model, or other form that captures our central view and meaning, data vault modeling techniques might indeed be found to meet these needs.

What does 2020 hold for data vault?  I suppose the only thing we do know is that it will be different from today.  That and we also know that we will all be party to what does happen.  So above are three trains of thought, none of which are mutually exclusive of the other, but all of which have already left the station. Let’s enjoy the ride.

EDW: All Data is Unstructured

While I was preparing to speak at Bill Inmon’s Advanced Architecture conference the other day I had a couple of epiphanies or “sudden realizations of great truth”… two of them in fact.

They both deal with enterprise data warehousing (EDW); meaning that they relate to real data warehousing including the classic characteristics of integrated, non-volatile, time-variant and etc.  So here is the first one:

1)      All data is Unstructured to an EDW.

To be more specfic, all of the data that we source into the EDW can be considered unstructured.  For all of you who just said “unstructured data is a misnomer, all data has some structure” – please use semi-structured or multi-structured instead.  For now let’s use the label of n-structured for the superset of these categories.

The concept of “structure” in the world of data effectively translates to context, consistency and predictability.  A structured source of data has table definitions that contain attributes with field definitions and some representation of context.  In this way we have predictability (we know what to expect), there is consistency (all data for this source arrives in the same structures as defined), and we have some level of associated context (from the simple association of entity keys defined by their attributes to more comprehensive metadata and possibly domain values for validation). 

The contemporary concepts of n-structured data stem from the idea of working with data that somehow does not fit the above description of structured data.  This is to say that this broad category of data falls short somewhere among the concepts of context, consistency and predictability.  To carry this further, this data may not have table definitions with set attributes and field definitions.  We often don’t know what to expect, the data does not arrive consistently, and there is little to no associated context.  Examples include text blobs (contracts, emails, doctor notes, call logs, blogs, social media feeds, etc.), multi-media files (scans, images, videos, sound files, etc.), as well as key-value pair or name-value pair (KVP, NVP) data, XSD-free XML feeds, and other similar types.      

We recognize this type of data exists and that it should also be included in the scope of our data warehouse.  But the assertion here is that all data is Unstructured to an EDW.

Consider that an EDW, by design, a) integrates data from multiple different sources, and also b) maintans the history associated with this data.  We know that the source systems do not share the same structures or context.  We also know that source systems will change over time.  So when the sources are contemplated together, and over time, they do not have context, consistency and predictability.  Since there are changes over time, all of the source data does not have consistent table definitions with set attributes and field definitions.  We don’t know what to expect over time, data does not arrive consistently, and there is little to no associated enterprsie-wide context. 

So, from several disparate systems, and over time, data is not structured.  Since the EDW integrates data from several disparate systems, and maintains history over time, the EDW sees all data as not structured.  In this regard, all data is Unstructured to an EDW.

2)      With an EDW, data integration is impossible.

One of the core concepts of data warehousing is data integration.  To this point nearly everyone in the industry will agree.  Data integration implies that we put data together with like data so that we can support a higher level, central view of the data.  But there is a problem.  To integrate data we need to integrate around something – some form of integration point.  With an EDW, the integration point is some form of central, enterprise-wide concept or key.  This enterprise-wide key should then represent the enterprise view of that concept or key.  In other words, the integration point is not source-system centric, is not department centric but rather is organization-wide centric. 

These should then be consistent with the ongoing MDM initiatives, business glossary, and other data centralization initiatives.  The problem is that no such initiatives have been fully completed and adopted in any company.  In fact we can expect that true semantic integration at the organizational level will never happen. 

So if we don’t have a defined integration point, we can’t integrate to it.  Which means that in an EDW, data integration is impossible

Then why do we continue to try and integrate data in the EDW?  There are two answers to this.  First, we can do our best to integrate data around keys that have already been defined, and second, we can target a expanded concept of integration, alignment and reconciliation.  This second point implies that we integrate where possible, then align keys where they remain separated, and at the same time provide for the ability to reconcile these differences.

If we put off the data warehouse until all central meaning has been defined and adopted then we will never have a data warehouse.  So by adopting this concept of integrate / align / reconcile we can start now and be part of the process to move towards central context and meaning.

And I believe that this is our charge.  We should be part of the process to move towards an integrated and centralized view of an organizations data.  At the same time however we should recognize that the end goal is not to achieve fully integrated data.  But rather data that is integrated to the extent possible, aligned so that we can contrast and understand the differences, and reconciled so that we can meet the needs of both the departments and the enterprise from a trusted and auditable EDW.