Tag Archives: semistructured data

EDW: All Data is Unstructured

While I was preparing to speak at Bill Inmon’s Advanced Architecture conference the other day I had a couple of epiphanies or “sudden realizations of great truth”… two of them in fact.

They both deal with enterprise data warehousing (EDW); meaning that they relate to real data warehousing including the classic characteristics of integrated, non-volatile, time-variant and etc.  So here is the first one:

1)      All data is Unstructured to an EDW.

To be more specfic, all of the data that we source into the EDW can be considered unstructured.  For all of you who just said “unstructured data is a misnomer, all data has some structure” – please use semi-structured or multi-structured instead.  For now let’s use the label of n-structured for the superset of these categories.

The concept of “structure” in the world of data effectively translates to context, consistency and predictability.  A structured source of data has table definitions that contain attributes with field definitions and some representation of context.  In this way we have predictability (we know what to expect), there is consistency (all data for this source arrives in the same structures as defined), and we have some level of associated context (from the simple association of entity keys defined by their attributes to more comprehensive metadata and possibly domain values for validation). 

The contemporary concepts of n-structured data stem from the idea of working with data that somehow does not fit the above description of structured data.  This is to say that this broad category of data falls short somewhere among the concepts of context, consistency and predictability.  To carry this further, this data may not have table definitions with set attributes and field definitions.  We often don’t know what to expect, the data does not arrive consistently, and there is little to no associated context.  Examples include text blobs (contracts, emails, doctor notes, call logs, blogs, social media feeds, etc.), multi-media files (scans, images, videos, sound files, etc.), as well as key-value pair or name-value pair (KVP, NVP) data, XSD-free XML feeds, and other similar types.      

We recognize this type of data exists and that it should also be included in the scope of our data warehouse.  But the assertion here is that all data is Unstructured to an EDW.

Consider that an EDW, by design, a) integrates data from multiple different sources, and also b) maintans the history associated with this data.  We know that the source systems do not share the same structures or context.  We also know that source systems will change over time.  So when the sources are contemplated together, and over time, they do not have context, consistency and predictability.  Since there are changes over time, all of the source data does not have consistent table definitions with set attributes and field definitions.  We don’t know what to expect over time, data does not arrive consistently, and there is little to no associated enterprsie-wide context. 

So, from several disparate systems, and over time, data is not structured.  Since the EDW integrates data from several disparate systems, and maintains history over time, the EDW sees all data as not structured.  In this regard, all data is Unstructured to an EDW.

2)      With an EDW, data integration is impossible.

One of the core concepts of data warehousing is data integration.  To this point nearly everyone in the industry will agree.  Data integration implies that we put data together with like data so that we can support a higher level, central view of the data.  But there is a problem.  To integrate data we need to integrate around something – some form of integration point.  With an EDW, the integration point is some form of central, enterprise-wide concept or key.  This enterprise-wide key should then represent the enterprise view of that concept or key.  In other words, the integration point is not source-system centric, is not department centric but rather is organization-wide centric. 

These should then be consistent with the ongoing MDM initiatives, business glossary, and other data centralization initiatives.  The problem is that no such initiatives have been fully completed and adopted in any company.  In fact we can expect that true semantic integration at the organizational level will never happen. 

So if we don’t have a defined integration point, we can’t integrate to it.  Which means that in an EDW, data integration is impossible

Then why do we continue to try and integrate data in the EDW?  There are two answers to this.  First, we can do our best to integrate data around keys that have already been defined, and second, we can target a expanded concept of integration, alignment and reconciliation.  This second point implies that we integrate where possible, then align keys where they remain separated, and at the same time provide for the ability to reconcile these differences.

If we put off the data warehouse until all central meaning has been defined and adopted then we will never have a data warehouse.  So by adopting this concept of integrate / align / reconcile we can start now and be part of the process to move towards central context and meaning.

And I believe that this is our charge.  We should be part of the process to move towards an integrated and centralized view of an organizations data.  At the same time however we should recognize that the end goal is not to achieve fully integrated data.  But rather data that is integrated to the extent possible, aligned so that we can contrast and understand the differences, and reconciled so that we can meet the needs of both the departments and the enterprise from a trusted and auditable EDW.