👋 Hi folks, thanks for reading my newsletter! My name is Diogo Santos, and I write about data product principles, the evolution of the modern data stack, and the journey to data mesh (the future of data architecture).
In today’s article, I’ll discuss the biggest challenges in the modern data stack. How do we got here, what are the main problems, and how to address them. Please consider subscribing if you haven’t already. Reach out on LinkedIn if you ever want to connect.
Tags: Modern Data Stack | Data Products | Data Strategy | Data Engineering
How data looked like in the early days
A decade ago, many companies’ data aspirations were mainly limited to business intelligence (BI). They wanted the ability to generate reports and dashboards to manage operational risk, respond to compliance, and ultimately make business decisions based on the facts, at a slower cadence.
In addition to BI, classical statistical learning has been used for business operations in the insurance, healthcare, manufacturing, and finance industries. These early use cases, delivered by highly specialized teams, have been the most influential drivers for many past data management approaches.
In resume, this was how data looked like in the early days:
Data was used for specific case analysis, not strategic decisions
Data was mostly used to automate reporting
The beginning of the statistical modeling for the business
The Traditional Data Stack
The traditional data stack (TDS) is another name for on-prem data systems.
Organizations managed their own infrastructure and hardware which was a burden in terms of fragility (high resistance to any change), high cost of maintenance (very manual-intensive maintenance), lack of scalability (hard to provision new infrastructure when the stack needs), rigidity (due to bottom-up maintenance) and complicated root cause analysis.
The data landscape started changing
More or less around 2010, the rise of big tech and the overall growth in technology created different challenges for the data stack. Organizations now had to handle the following:
Increased Data Volume - This would require moving from rigid governance and heavy modeling in the data warehouse to a less controllable data lake environment where data could be stored. The cost to store so much data also became a big challenge to manage.
New Types of Data - New types of data emerged such as text, images, or audio. Most organizations at the time didn’t even know how they would take advantage from this unstructured data.
More Business Use Cases - Organizations could now leverage the volume and new types of data to build more accurate models to improve decision systems. NLP, Computer Vision, and Recommender Systems became more accessible to all organizations.
The Modern Data Stack
With the new challenges, the data stack had to evolve to a new state. That is when the modern data stack (MDS) arrives.
The biggest achievement from MDS was the shift to the cloud, which has made data more accessible, recoverable, and easier to manage from a technical perspective. The MDS facilitated the data collection and ingestion, support for high-velocity data streams, and high scalability at a low cost.
MDS is a collection of multiple tools that are connected to each together to enable an active flow from physical data to business insights.
They are characterized by their simplistic deployment in the cloud, scalability, and component-based nature. Each tool solves distinct challenges in data such as the separation of computing and storage (Snowflake, DataBricks), Data Lineage (Stemma, Alation), Transformation (dbt), Job Orchestration (Airflow, Prefect), Schema Management (Protobuf), Streaming (Kafka), Monitoring (Monte Carlos, Bigeye) and many more.
Which problems have emerged?
MDS came up as a disconnected group of tools that originated heavy pipelines and data dumps to a central lake, creating unmanageable data swamps across industries. These tools were never built to work together across the data value chain.
Moreover, data has outgrown the initial use case of providing a dashboard for an executive and grown into hundreds of models and dashboards in a few short years. This has resulted in a series of problems:
With data being ingested from different sources, understanding the context of data became harder as the data warehouses could no longer be a replication of the real world through data, where entities and tables would be connected to each other.
Many data initiatives reuse the same data with a different name or reference unmaintained tables.
Good testing is rare. Debugging is challenging.
Teams struggle to understand the source of truth for important data and start creating their own bespoke tables for ad-hoc questions resulting in ‘data debt’ (more about data debt in future posts).
Data teams spend months generating feature sets for ML models, creating metrics, running experiments, and data munging.
Critical datasets break down all the time and lack accountability and ownership.
We are witnessing an increase in data debt, more bugs to handle on a daily basis, and a huge lack of control over the data warehouse, which lost importance in the last decade despite being the most important data asset in the organization.
The majority of organizations are, either witnessing such problems now because their data stack has evolved or, they are about to witness them as they continue to invest in their data journey.
What can be done now?
The Modern Data Stack ended up solving mostly engineering challenges related to cost and performance, generating more problems when it comes to how the data is used to solve business problems.
The primary objective of leveraging data was and will be to enrich the business experience and returns, so that’s where our focus should be.
Below are some ideas of what can be done to reduce the bottleneck between data production and data consumption:
Data Warehouse as the foundation for all analytics
Building a semantic mapping between all the data being ingested from different sources will enable a truly effective Data Warehouse and a much better experience for the data consumers. Much more time needs to be invested to understand how different data sources connect to each other and map the real world.
Connecting Software Engineers (SWE) to data workflow
Software Engineers produce the majority of data used in business reports, experiments, and models. Yet, they have no idea how their data is being consumed.
Data engineers end up as middlemen teams spending more time fixing pipelines because something changed upstream (in back-end or front-end services) than creating new ones to power business opportunities. That being said, some changes need to occur in this process:
Data Contracts First - Data contracts are expectations on data. These expectations can be in the form of business meaning, data quality, or data security and governance. This ensures data engineers understand upstream data and avoid pipeline breaks due to upstream changes.
Engineering deeper into the data generation lifecycle - Engineering teams need to understand how the data they are producing will be used in downstream processes. This will ensure they will take that into consideration when planning changes.
Data quality is owned by the engineering team as they are the ones producing the data, at least until the data is loaded into the lake.
Data engineering closer to business context
Data engineers are responsible for setting up and managing the data platform and its workflows. They act as the middleman between software engineers and data scientists and analysts. Yet, most of the time, they are building pipelines with no business context nor clue of the final goal of the tables they’re delivering.
Without a business context, data engineers are prevented to understand how different data points should be connected and fail to build a data warehouse that maps to the real world.
Data Product Thinking
The data team needs to ensure that the data products they are building are solving a real user problem. Addressing a data product from a merely technical perspective is no longer enough. Ensure a market fit between the data product and the user problem must become part of the data workflow.
You can read more about data product thinking in my previous article:
New Data Modelling Framework
Traditional data modeling faces challenges such as high governance, very rigid processes, not being easy to iterate, and longer time to insights. Data modeling design worked well at a time when data was under control and teams could ensure that the data ingested would fit a specific schema.
But with the increase in data volume and data sources, it became harder to apply data modeling and that’s when data warehouses start losing their main purpose.
The data ecosystem needs to think about Data Modelling 2.0 for the Modern Data Stack. A decentralized data architecture, where data is distributed across domains, can be the solution to this problem.
Define data governance policies
Most of the investments over the last decades in data initiatives were focused on more and better technology. Not much time was invested in processes and data management. Here are some practices to ensure data is managed effectively, securely, and ethically:
Data Ownership - Who owns and is responsible for specific data. This ensures data ownership and accountability.
Data Quality Standards - Set of criteria to ensure data is accurate, complete, consistent, and timely. This criterion ensures data is reliable and trustworthy.
Data Catalogs - A repository of all data products within an organization, providing a location for metadata, documentation, and data lineage making it easier to discover, understand, and use data.
Data policies and procedures - A set of procedures to define how data should be collected, stored, processed, validated, and use in the organization.
Data Lineage - Provides a clear understanding of the origin and movements of data.
Is Data Mesh a valid solution?
In my mind, it’s not the framework that matters the most, but rather our capability to ensure that data can successfully enrich the business.
Data mesh is covering several of the challenges I’ve mentioned in this article, but it’s not the solution for everything and it can’t be applied to all organizations with eyes closed. Besides, it’s a framework in the early days, and adaptations of what data mesh really is will start to emerge as businesses start adopting it.
In closing, despite my seemingly negative view of the Modern Data Stack, I am very optimistic about the data industry's future and our capability to adapt and improve.
If you want to read more content make sure to follow me on LinkedIn for more weekly posts, and if you liked my article, please consider subscribing.
Thank you so much for reading and let’s talk again soon.