This is a long due post, predominantly because there is a lot of confusion around data lineage , data observability and the interdependencies.
As many of us know, Data Lineage is one of the most discussed topics today and so is data observability. This article covers the applicability of data lineage in Data Observability.
PS: If you are specifically looking to understand lineage better, these are few of my favorites on this topic.
Building Data Lineage: netflix & Leveraging Apache spark for Data Lineage : Yelp
Data lineage tracks the changes and transformations impacting any data record. Data lineage tells us where data is coming from, where it is going, and what happens as it flows from data sources and pipeline workflows to downstream data marts and dashboards.
In short, it enables better data governance by giving you a more complete, top-down picture of your data and analytics ecosystem.
Data lineage is essential for understanding the health of the data because it ensures transparency around the source of data, ownership, access, and its transformation. These are excellent indicators of the reliability and trust of the data. However unlike other data quality indicators like Completeness, Validity, Accuracy, Consistency, Uniqueness, and Timeliness.(Read here) Data lineage can not be metricized.
The insights provided by Lineage enable data users to solve all kinds of problems encountered within mass quantities of interconnected data like Data troubleshooting, Impact analysis, Discovery and trust, Data privacy regulation (GDPR and PII mapping), and Data asset cleanup/technology migration, and Data valuation.
Data Observability is a set of measures that can help predict, identify, and resolve the data issues. This is often done by leveraging statistical analysis and machine learning.
Data Observability aims to reduce the mean time to detect (MTTD) and mean time to resolve (MTTR) data issues.
I MG SRC : Telm.ai
Data lineage can be very effective in getting good Observability outcomes.The insights provided by Lineage enable data users to quickly find the root cause of issues , conduct impact analysis and also invoke a remediation.
So your data observability should leverage lineage information.
Today, most data observability companies leverage Data Lineage for mean time to resolve (MTTR) issues by doing two things,
Root cause analysis (RCA)
Once a monitoring system detects an issue, Lineage can help investigate the previous steps in the pipeline and changes. Specifically, if you are using Datawarehouse monitoring tools like Montecarlo data, Metaplane etc you can find table/view lineage that can help shrink the root cause analysis time.
Downstream Impact Analysis
Often Data Lineage can help find downstream impact and invoke remediation flow. Remediation by definition is an reactive approach and can become a nightmare.
If you have a datamesh architecture, and have multiple consuming products using the data, leveraging lineage helps you identify which products are impacted and who own those products/systems. Now you can systematically notify those users. Another use case is using Lineage where you can identify the impacted dashboards.
The above methods are definitely useful but still suboptimal for data observability.
Data Lineage for RCA and Impact analysis
Data Observability, by definition, should be proactive. Data engineers should use Lineage to check downstream impact before adding any changes. For Example, am I making schema changes? Let me automatically check the downstream implications.
For data in motion a full pipeline monitoring approach will not only help with (resolve) MTTR but also MTTD(detect) issues. Design a metric monitoring system that monitors every step in the data pipeline. So users get alerted when there is a drift or outlier in a specific step of the pipeline, and multiple data points will help detect issues even before they have downstream impact.
Example : Get alerted that the CRM system got partial data at 2pm PST.Such an alert can automatically be sent to the CRM system owner, so data engineers don't have to find out about these issues inside Snowflake or BigQuery and reverse track them back to the CRM system. Often this approach will also help orchestrate the date pipeline flow refer: Circuit breaker pattern.
A good analogy would be monitoring vs. tracing and logs. If you have a lot of monitoring, you may not need tracing and log review because tracking at every level will highlight the same problems but sooner. This is because all methods highlight the same issues just from different angles. So I would say that Lineage is a good tool for root cause analysis and impact analysis (MTTR) but your observability tools should focus higher on for mean time to detect (MTTD) issues i.e be more proactive.
Data Lineage is a crucial aspect of Data reliability. However, to effectively reduce both the MTTD and MTTR data issues :
1: Leverage lineage to understand the downstream impact before making changes.
2: Monitor important data metrics every step of the pipeline and leverage Lineage to identify exactly where(step of the pipeline) the which metric has drifted.
This approach enables users to reduce both time to detect and time to resolve data issues.
Similar Journal