Is the Data Warehouse Dead?
Data Warehouses Are A Bad Investment For Most Companies
In this article, Part 1 in a series on The Evolution Of Data Architectures, we’ll take a look at the data warehouse. Data warehouses have been around for decades, but in the last decade or so they have achieved widespread adoption across all industries, as cloud-based platforms like AWS Redshift, Google BigQuery, and Snowflake have become available. Tens of thousands of companies use some form of a data warehouse architecture, and while there are newer architectures like the data lake and lakehouse (more on this in the rest of the series), and that isn’t going to change overnight. But why are DWs so common when, as we’ll see in this article, they so consistently fail to meet companies’ needs, despite their high costs? Let’s look first at the history of Data Management (DM) practices to understand how we got to this point.
Data Warehouses Are Ubiquitous
Spoiler alert: the answer to the question posed by this article, for better or worse, is NO. Let me be careful to say that traditional data warehouse systems have their place, and will be a part of the data landscape for a long time to come (not only because so many companies are heavily invested in them, but also because, for some companies’ situations, they do make sense). But all too many companies are investing money and time in systems which simply do not make sense for them, and which could be replaced with much better alternatives.
Over the last several decades, the landscape of digital data management has developed immensely, in ways that would have been hard to forsee in the days of mainframes which stored data on tape and were programmed using punchcards. Allow me to take a moment to discuss the evolution of data management tools over the years as a way to illustrate my point about the common unsuitability of data warehouses.
Brief History Of Data Management Approaches
The introduction of RDBMS (Relational DataBase Management Systems) revolutionized business computing. With RDBMS, users can specify data schema, defining entities and the relationships between them, enabling them to establish a high degree of structure in data and therefore making it much easier and faster to perform the kind of data processing used for many common applications like e-commerce. RDBMS systems are ubiquitous, and probably will be for a long time to come, since the advantages are manifest.
But with increasing amounts of data, and the growing need for businesses to be able to make sense of it, came the need for analytics (see my discussion of this evolution in the article Are DataBricks and Delta Lake Right For You?). RDBMS systems are not optimized for analytics, and a strong need arose in the market for newer systems which could support analytics workloads, but which still could be used in a similar way to RDBMS (by querying against structured data which is grouped into tables). Data Warehouse platforms (Redshift, BigQuery, Snowflake) enabled much more powerful analytical operations, and in this sense they met the need that had arisen in the market. But as with many technologies, the drawbacks did not become apparent for a while - it took many years of use by thousands of companies for a consensus to emerge that these DW platforms were not the panacea they had been marketed as. The world of data has advanced very rapidly in the last ten years (particularly in the areas of AI/ML and streaming), faster in fact that the market could respond (since it takes years for a major new product to be developed, tested, released, and used under real-world conditions)/
So although data storage and processing systems have come a long way, and tools like AWS Redshift, Google BigQuery, and Snowflake incorporate many recent sophisticated developments in computer science, data warehouse-style databases still in large part resemble systems those that were available thirty or forty years ago. They often use proprietary storage formats, are designed for use primarily with structured data, and couple the scaling of compute and storage. Having spent years doing data science, analytics, and data engineering, I know that working with these systems often entails immense frustration. I can’t begin to count the number of times, over the last decade, that I found myself working on a data warehouse project that left me thinking, “Why in the world is this so complicated? There has to be a better way.”
The Many, Many Limitations Of The Data Warehouse
Traditional data warehouses have a number of significant limitations, and these are not marginal or minor; in fact, as we’ll discuss next, many of these drawbacks are failings that DW systems were, at least in theory, specifically designed to avoid. Most of these limitations are inherent to the design of the systems and can, in practice, be crippling and leave companies failing to realize ROI (Return On Investment):
Redundancies abound
Because of the architecture of traditional DWs, they often result in the generation of several redundant versions of the same data. A 2021 study by Dremio(found that enterprises are, on average, maintaining twelve copies of their data. Think about that for a moment. In a word, speaking non-technically, that’s nuts. In fact, it makes my head hurt just thinking about it. If you’re working with a dozen copies of the data across your organization, you are certain to frequently encounter inconsistencies between these versions. Even assuming individual inconsistencies are small, in the big picture they add up to a situation in which the reliability of data is very low. In addition to unreliability, this creates serious issues with costs. Even though it is often bandied about that in the current day data is free, that is of course absurd. Data storage costs have fallen dramatically in recent years, but they definitely still exist. And volumes of data are rising rapidly, perhaps more rapidly than costs are falling. Multiple copies of data mean that a lot of companies are spending much more than they need to on storage costs. A company I worked with recently was spending $2000/mo on their Redshift service, for a cluster that was not very large by enterprise standards. That’s a lot of money for one line item in a cloud budget for a relatively small company.
Inconsistency and inaccuracy of data
Closely related to the point above, data is often incomplete or just wrong. Besides redundancy, there a number of causes of inaccurate data. Ineffective maintenance practices, poorly designed ETL processes (which are common, given their complexity), and other causes can result in downstream data diverging from the ground truth. Data management systems can be very complex, and this complexity can result in misuse yield unreliable results. This situation is a decision-maker’s nightmare, because the quality of decisions in a business is heavily dependent on the quality of the data going into them (or, put another way, garbage in, garbage out). Add to that the fact that DW systems are often very expensive means that many companies are investing large amounts of money and time into systems which are giving them bad results, because decision makers are using low-quality data as the basis for their decisions.
Inordinately high maintenance costs
DWs are highly complex systems, and companies often get sticker shock as a result of the pay-as-you-go billing model when maintaining large DW deployments. It is technically true with many cloud services that “you only use what you pay for,” but companies should always keep in mind when using these cloud services that the cloud vendors are making extraordinarily high profit margins for renting services like computing virtual machines and object storage. The billing models can be complex, and many companies do not take the time to calculate what they actually expect to pay (or what the equivalent on-premise option would cost). There is certainly a convenience to renting services rather than buying all the hardware and software up front, and managers and accountants are often eager to minimize capital expenditure (which, in the cloud realm, is directly inverse to operational expenditure). But this means that many companies (94% in fact) are significantly overspending on cloud. This is perhaps the biggest issue in the industry. In the first years of cloud adoption, companies were eager to switch to the cloud, which minimized initial cost outlays and allowed easy access to services that otherwise would be difficult or impossible to deploy. But in recent years, many companies are realizing that going all in on the cloud might not be the best investment. And data warehouses are one of the worst offenders in this cost overrun. DW systems are complex, and this complexity creates high costs both in terms of the direct cost of the service and the cost of the large amount of person-power required to develop and maintain them.
Lack of self-service access and delayed time to insight
Because these systems are difficult to develop, requiring layers of ETL and orchestration and scheduling tools that are beyond the competency of most data analysts (who already have enough work on their plate writing complex SQL queries that often join multiple tables and perform complex grouping and aggregations), teams of data engineers are required to develop and maintain them. I worked with a tech company that employed 500 people and, because of the nature of their niche in the industry, dealt with a large volume of data. Their data architecture diagram looked like a couple of handfuls of spaghetti had been thrown at the wall to see what stuck, and they employed a team of ten data engineers. Data engineers make $125k and up, so with other benefits their total compensation is often $200k even for entry- and mid-level DEs. That means this company spent two million dollars a year or more just on data engineers, and that wasn’t even the whole data team - on top of that they had a team of several analysts (who don’t make as much but are still well paid) and a couple of data scientists (who also make the big bucks). This company was spending perhaps five million dollars a year just on the cost of labor for their total data team, and that doesn’t even count the cost of cloud bills for all the services required to support their complex data architecture (AWS Redshift, Firehose, Athena, Aurora, Kinesis, etc.).
And this company is in some ways considered a success story, because at least they have a fair amount of ‘self-service’ reporting (can you call it self-service if it takes a team of twenty people to build the data infrastructure and compose the reports on an ongoing basis?) But even those self-service report pipelines often broke and had to be fixed by a DE on call at 2AM, and the data teams were under unrelenting pressure to keep them working. Even this company, with all its dedicated resources, had to field frequent internal requests for reports that did not already exist, and these can take weeks or months to build, particularly if architectural changes are required. When this is the case, the reports often end up lacking in value, because by the time the report is available the question is no longer relevant. As you can see, even in organizations that have ‘well-developed’ data operations and cultures (at least as defined by amount of money and time spent), often developing and maintaining the data warehouse is extraordinarily costly and makes employees miserable.
Vendor lock-in
Companies invest huge amounts of time and money developing data warehouse systems, and this creates a situation where they feel like they have no other alternative. If you’ve spent (as in the previous example) several million dollars a year developing and maintaining a ‘solution’, and you feel that the only alternative is a system of similar cost and complexity from different vendor, what are you going to do? You’re not going to go to the huge trouble of rebuilding your architecture on a different cloud provider, migrate all the data and reconnect all the reports, and retrain employees for a marginal cost saving. At that point you are stuck - the company has your data and you can’t easily take it elsewhere, so you’re going to keep paying the cloud bill to operate all of this heavy machinery (technical and organizational), because there’s simply no other choice. Even if there is a new and better technology available, it would have to be significantly cheaper and better to justify the overhaul, which (despite the hype) is rare in the tech industry (though there are exceptions like DataBricks).
Data Ownership and security
When you use a cloud-based data warehouse, you are trusting your cloud vendor to handle the security of your data. Managed cloud services come at a price, and for many companies this is a good trade-off, since they can pay Google or Amazon or some other vendor to deal with making sure their DW is secure. But this is not a magic wand that makes the security issue go away. The risks of data theft or leakage are very real and threaten companies’ reputations, IP, and bottom lines. Many companies assume that, since they’re paying large fees to their service provider that the security of their DW is ensured; but security and privacy still require companies to design effective data management policies, which have physical, technical, and administrative components. This Study of Privacy Issues Associated with Data Warehousing gives a deeper dive into the problem and proposes solutions that include hiring qualified personnel, controlling authentication, and implementing adequate governance and monitoring policies. But these are not easy or cheap solutions, and they mean that the data warehouse itself is not sufficient to ensure data privacy and ownership.
Data Rigidity
This overlaps somewhat with maintenance costs but is an issue in its own right. DWs typically work with highly structured data, which means it is well-defined by a preset schema. Data in raw form is often unsuitable for reporting without being combined with other data and further refined. This process, called ETL (Extract Transform Load) can be very complex and costly. It is common for data to need multiple steps of ETL processing before it can be used in reporting. This additional complexity is expensive and leaves a lot of room for error. DW systems themselves (like Redshift) do not usually have robust ETL capability built-in; usually another tool like dbt (and perhaps and orchestration tool like Airflow) is needed. Another problem this contributes to is lack of timeliness, because these layers of ETL take time for a data engineer or analyst to build.
Quantifying the Shortcomings of the Data Warehouse
When looking at that list, you might think to yourself, “Why did the data warehouse ever achieve so much widespread adoption in the first place, then?” And that’s a very fair question. I didn’t make any attempt to exaggerate the shortcomings of DWs - I’m just speaking from long experience. Many or most of the problems listed above are in fact ones that the data warehouse model was developed to address. Issues like redundancy, security, inconsistency, and lack of self-service access are supposed to be strong points of the data warehouse architecture. The question of how, despite so much technical development, the data warehouse model has become so inadequate for addressing many of the core concerns of data management is a topic for another article. Suffice it to say here that the data warehouse is a badly dated model, and companies looking to select a data management model could do much better than the data warehouse. I’ll have more to say on that in future posts on architecture. Companies that already have data warehouses have made a significant investment in them, and if they find themselves unsatisfied with the result they have to make the difficult choice of investing more time and money in implementing a new DM (data management) approach, or continuing to spend on an approach that might not be paying off.
Report: Only 20% Of Companies Achieve ROI With Data Warehouses
You don’t have to take my word for it that the data warehouse is an outdated paradigm that does not meed the needs of many companies. Let’s take a look in more detail at the report linked above, by Dremio:
Data Warehouse Costs Soar, ROI Still Not Realized
That report is valuable because it grounds analysis of data warehouse effectiveness in terms of Return On Investment (ROI). The report offers some important and even shocking findings:
- in the aggregate, American companies spent $73 billion on data management software in 2020. That means that it’s an entire large industry unto itself.
- only 22% of companies have fully realized ROI in the last two years, and most (56%) have no way to measure it. In other words, most business are stuck paying for DM solutions that aren’t delivering, and many of them are flying blind when it comes to understanding whether they’re getting their money’s worth.
- data leaders are frequently concerned about out-of-control data warehouse costs and are eager to find newer architectures that deliver better technical and financial performance. That means there are many opportunities to help companies break out of a rut of low-performing data solutions and implement better solutions which are cheaper, easier to maintain, and better able to meed modern data needs like operationalizing machine learning pipelines.
- as discussed above, the average company makes twelve copies of its data, meaning that companies are spending time and money on redundant data, and introducing unnecessary complexity which creates ambiguity and inconsistency in the data, lowering the quality of decision-making.
- 82% of companies say that their end users have used inconsistent copies of the same dataset at the same time due to their ETL processes, undermining data integrity and slowing down the decision-making process.
- 94% of data warehouse users report concerns over it.
- 79% report concerns about scaling their architecture.
- 84% say it is normal for analysts to work with a partial (meaning possibly incomplete) dataset.
- only 16% expect fresh data in minutes or hours; most expect it in terms of weeks. In the modern digital environment, with so much connectivity and so many tools available, that’s a long time to wait. And for many companies it is often too long to wait, since change happens fast (particularly in tech) and the decision is sometimes irrelevant by the time the report is delivered.
- 76% say they are locked in to certain vendors due to closed formats and systems, so only a quarter of companies have freedom to explore new tools and architectures. Even though so many companies are aware that they have adopted poorly performing solutions, they simply don’t feel like they have any other choice.
Why Have Data Warehouses Become The Norm?
These key findings quantify the list of DW shortcomings earlier in the article, and they help to further paint a dismal portrait of the state of the data warehouse in modern business. Surely companies wouldn’t spend so much money on solutions that aren’t working, you might think. But many studies show that, in fact, that is just what is happening. But why?
Inertia (Change Happens Slowly, Even In Tech)
One answer to why this is the case is inertia. The lifecycle of the data warehouse is long (many years) given the large investment, and many companies are now stuck with data warehouse architectures that they put in place five or ten years ago. Ten years ago the data warehouse was new and exciting territory, with (for example) AWS Redshift having just been released and promising to revolutionize data management, giving companies easy access to a powerful new way to perform analytics. Now, in 2023, the pitfalls of those systems are much more apparent, but many companies are locked in to ‘solutions’ that haven’t solved much.
But it’s not just companies that made the decision a decade ago that have adopted data warehouses; companies continue to adopt these outdated architectures even though enough information is out there that they should know better. I worked with a company just last year that was only then getting around to implementing a Redshift cluster. Even though DWs are not a great solution any more, companies continue to adopt them because they see so many other companies using them. We are at a strange point in the market cycle where lots of companies are using data warehouses, but most of them are not happy with them. But other companies who don’t have as much data management experience see so many other companies using data warehouses and think they must represent a good solution, because they’re an ‘industry standard’. So companies continue to pour money into these platforms. And if you’re investing in a data warehouse solution in 2023, you might be setting yourself up for failure, because there are much better solutions available.
Data Warehouse Are Very Profitable - For The Vendors
Another part of the answer to, “Why, God, why the data warehouse?” is marketing. The rise of data warehouses tracks closely with the rise of cloud computing, and DW systems are a huge moneymaker for cloud vendors - as the Dremio study explains, DM software was a $73 billion industry in 2020. Going down the above list of DW limitations, many of those issues actually represent selling points for cloud companies. If the underlying architecture is unwieldy and costly to maintain, that’s a feature and not a bug for cloud service providers, particularly since (until more recently) there was no easy alternative to the DW, and the choice was limited to which vendor’s overly expensive data warehouse solution should we adopt?.
But over the last decade, as data warehouses have continued to see widespread adoption, the world of data has changed massively, much more than in any previous decade. AI/ML is one of the most important areas of recent development. When many data warehouse systems were conceived in developed in the 2000s and early 2010s, AI and ML were a pipe dream for most organizations, a mysterious and fascinating new area of research but one which only large organizations like Google and Microsoft and Netflix were working on or had access to. Smaller organizations that adopted AI/ML tended to do so only for very limited and specific applications. But now AI is becoming accessible to every company (witness the rise of ChatGPT) and companies feel pressure to adopt their own AI solutions.
For companies wanting to build their own AI/ML models and train them on their own data, there are more tools now available than ever. But the data warehouse isn’t a great place to start for this purpose. For many ML applications, building a pipeline straight out of the data warehouse is not a good choice for a high-performing solution, because these systems are optimized for analytics workloads, which have different computational requirements than ML workloads. So building an effective ML pipeline often involves moving data into another source. MLOps will require an entire set of cloud services in addition to the data warehouse to achieve performant ML pipelines, which again raises the question of why a data warehouse should even be used if it’s not suitable for the modern applications that companies demand.
Big Takeaway: DWs Are Here To Stay, But By Themselves Are Limited
It’s important, again, to acknowledge that there’s a reason why so many companies have invested in data warehouses. For conventional analytics and business intelligence applications, these systems can give good results (despite their cost and complexity). Many companies still struggle to achieve solid ROI, but many companies have gotten great results from data warehouses.
The point that I’d like to make is that in an ever-more complex world of data, the data warehouse by itself is increasingly an architectural anti-pattern. Ten years ago, before big data (with its myriad AI/ML applications, and increasing velocities of data, requiring a dedicated streaming paradigm) became much more common, the modern fleet of data warehouses like RedShift were cutting edge. They allowed virtually any company access to an enterprise, almost supercomputer-like capability for analytics workloads without having to build a giant cluster up front on-site. And for some companies, something like the Redshift-dbt-Tableau combination for processing, ETL, and reporting still works really well.
But more and more companies are taking advantage of AI/ML, and it is important to remember that for all of the prepackaged AI products hitting the market, there is still often no substitute for performing some dedicated in-house data science (cleaning, feature engineering, model selection / training/ testing). And it is increasingly common to deal with high-velocity streaming data, which represents a distinctly different paradigm that involves another layer of tooling and integration to work alongside the data warehouse.
If Not the Data Warehouse, or Only the DW, Then What?
Responding to the limitations of the data warehouse, the market has produced new solutions. The more recent rise in adoption of Data Lake architecture did not really change this picture very much. While data lakes somewhat de-emphasized analytical database services like Redshift and Snowflake, they still entail complex architectural arrangements that require the use of multiple cloud services (often in Rube Goldberg-esque combinations; see [pic] for more details). So even though the data lake did represent conceptual movement forward, and solved some of the issues facing DWs, the picture was largely still an either-or between warehouses or lakes, with both requiring a significant amount of vendor lock-in, since once you’re committed to an inordinately complex architecture on one cloud platform (or combination of them), you have purchased the ticket for a very long (and often bumpy) ride that it’s difficult to get off of without waiting for the roller coaster to arrive back at the station (which hopefully isn’t Bankruptcy Junction).
DWs are not just profitable for cloud vendors, they’re also big cash engines for third party vendors and developers, who make lots of money by selling peripheral tools that address some of the fundamental architectural issues with DW, and with (dare I say) consultants, who love data warehouses because they’re difficult to configure, develop, and run, meaning that companies often need lots of outside help to figure out how to make the DW work for them.
In the next article in the architecture series, we’ll discuss those data lakes, which represent an evolutionary step forward from data warehouses, and were designed to address many of the limitations of data warehouse platforms. Data lakes use open formats and object-based cloud storage to store data in its raw form, so they deal better with unstructured data, making them more flexible and suitable for modern applications like AI/ML and streaming. Data lakes also help avoid lock-in (since they use open formats), and (if properly architected) are more likely to allow companies to store data in a single, ‘ground truth’ location (instead of maintaining multiple redundant copies of the data). Thanks for reading.