Article
Reimagining your data modeling approach to better leverage analytics reporting
A modern approach to modeling your data that can reduce the complexity and cost of building your reporting data model
January 26, 2021
Data modeling has been around for the past several decades and has evolved to help organizations structure their data for storage and consumption. Data modeling techniques (like Kimball and Inmon) in the business intelligence space have played a pivotal role in how an organization approaches data gathering and sections it into data warehouses and data marts. This also aids enterprise reporting tools to access this data and deliver insights to end users.
The issue
As these enterprise systems were built, they started largely as on-premise databases, while reporting was limited to a canned reactionary look at the organization’s data. All the complexity was built into underlying data structures using database dependencies and complex ETL processes, and the reporting tool was simply a mechanism to deliver a static report with limited ad-hoc capability.
We’ll approach here a new data modeling concept that leverages modern-day reporting and analytics tool advancements to reduce cost, complexity, and maintenance of the data platform.
The current approach
The foundational concept driving reporting data modeling for the past 20 years is the star schema, first proposed by Ralph Kimball in 1996. A deep dive on how the star schema works is available here, but the primary concept is that tables are categorized into dimensions and facts. Dimensions describe actual entities within your organization (customers, stores, employees, accounts, etc.) and facts describe events that occurred (sales transactions, G/L transactions, orders, etc.).
The star schema gets its name from how the tables in the data model are laid out in an entity relationship diagram (ERD). The fact tables are the centers of the various stars, and dimension tables form the spokes of the stars.
However, given the state of storage and reporting technologies when the star schema concept was born, significant focus was given to enforcing dependency and error checking within the ETL code responsible for loading the data model. Concepts like surrogate keys and load ordering were necessary to ensure referential integrity of the data, and reporting tools expected data to come with those pieces intact. This has led to the ETL becoming complex and monolithic, and errors tend to halt the entire load process. As a result, the ETL process has historically tended to be the majority cost component for reporting implementations (both in initial implementation and ongoing support costs).
The solution
Fortunately, with a slight modification to our approach, we can now greatly simplify our ETL processes while still retaining the spirit and the fundamentals of the star schema.
The loose dimensional model concept has us still leveraging the idea of facts and dimensions, but it eliminates the use of surrogate keys to tie everything together. Instead, all joins between dimensions and facts are done using natural business keys. When the business key consists of multiple source elements, we can link those elements together to form a single composite key.
This modification to the data modeling approach greatly simplifies our data load processes in the following ways:
- The ETL process no longer needs to be concerned about generating and keeping track of surrogate keys.
- The load processes for each table in the data model no longer need to be sequenced in order to allow surrogate keys to be resolved.
- Missing dimension records no longer require a complex process to handle. Traditionally, either a placeholder record (sometimes called an inferred member) would be inserted into the affected dimension table, or the affected fact records would be inserted with a special surrogate key (normally -1 or -2) to represent a missing dimensional record. Both those processes require complex logic to allow the inferred member to be generated or for invalid dimension surrogate keys in the fact table to be re-processed.
However, with the elimination of surrogate keys, all tables can be loaded as normal without worrying about referential integrity, as the downstream reporting tool can now handle referential integrity violations and late-arriving dimension records can resolve themselves as part of the normal recurring ETL process. - Given the elimination of dependencies, the scheduled recurring ETL process can be more modular / parallelized, and a failure in a single table load will no longer bring the whole ETL process to a complete halt.
With a much simpler ETL architecture, we are now able to reduce cost, timelines, and risks associated with the ETL build portion of our platform build. Those hours can now be shifted into more tangible business-focused value areas (such as the final reports or KPIs).
The appeal of this approach is that it works with most modern toolsets in the market today. If we use a modern reporting toolset that supports data modeling (such as Tableau or Power BI, for example), we can take advantage of our reporting tool’s capability to handle the data modeling complexity internally and remove that complexity from our ETL process.
Whether using a more traditional cloud/on-premise ETL tool with custom development (whether that’s Azure Data Factory, SSIS, Informatica PowerCenter, or any of the other various ETL tools in the market) or a more modern metadata-driven platform (such as West Monroe’s own Intellio® DataOps), the loose dimensional modeling approach can be a great fit.
Conclusion
What we propose in the loose dimensional model is not a shift of the foundational data modeling concepts that have served us well for more than 20 years. We are still leveraging fact and dimension tables in our final data model, and our requirements gathering approaches do not change as a result. We are also not recommending a specialized toolset that will require an additional investment to ramp up resource skills.
Since today’s reporting toolsets can handle the complexity of entity relationships internally, there’s no longer a need to tightly control those relationships in the ETL layer.
We can then still leverage the basics of dimensional modeling that have worked well for many years and keep risks low while also making a tweak to our technical implementation that can save in development costs and risks.