Once upon a time, there was the enterprise data warehouse. The idea seemed to make sense: put all the data you want to analyze in one place, and then you can do all the analysis you might ever want to do. Economies of scale will make analysis easier and cheaper than having to figure out how to look at data sitting in lots of little piles scattered all over the place, which is messy and inefficient. The elegance of a single, optimized solution was very appealing.
And so we built massive monuments to data science. More and more data was sucked into the data warehouse, with complex ETL jobs to change it into the One True Format the data warehouse required. Data flowed into the data warehouse from all over the organization and a future of endless insights gleaned from Big Data seemed inevitable.
But there was a problem.New datasets kept getting created outside of the data warehouse. New applications would be written, and they would generate data, but that data would sit close to the application. Sometime later we’d decide that we needed to include this new data in our analysis, but that analysis had to be done on data in the data warehouse. To get it there we’d spin up a project to write ETL jobs and make copies of the data and attempt to massage it into the One True Data Warehouse Format. And then we’d do it all again for the next app, and the one after that.
And maybe our company would acquire another one, and then we’d have two data warehouses. Or a business unit would decide that they needed to do things a bit differently, so they’d create another data warehouse that was also, confusingly, an enterprise data warehouse.