Oct 18, 202230 min
Share via:

S01 E06: Understanding Full-Stack Data Observability with Salma Bakouk, Co-founder of Sifflet Data

Data quality issues have existed since the time businesses started using data to drive business initiatives. ‘Data Observability’ as a category is gaining a lot of attention and is maturing pretty fast. To understand the evolution and the current rise of ‘data observability’ we have Salma Bakouk with us, who with her team is building a tool that can help both data engineers and data consumers navigate data reliability and data quality issues.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Salma Bakouk
CEO and Co-founder

Salma Bakouk is the CEO and co-founder of Sifflet Data, a full-stack data observability platform. Before starting Sifflet, Salma was an executive director at Goldman Sachs, where she led various data initiatives and led the company towards being data driven. While working at Goldman Sachs, Salma encountered various data quality issues, which eventually led her to starting Sifflet Data. Salma holds an engineering degree in applied mathematics and statistics, and she describes herself as a mathematician by design, a data nerd, and a recovered investment banker.

In this episode

  • How Salma and her team started Sifflet Data.
  • Understanding full-stack data observability.
  • Why data lineage is a tough problem to solve.
  • Tackling alert fatigue.
  • Rise of data observability.

Transcript

00:00:00
Hello everyone and welcome back to another episode of the Modern Data Show. For today's episode, we have Salma Bakouk with us who is the CEO and co-founder of Sifflet Data, a full-stack data observability platform. Before starting Sifflet, Salma was an executive director at Goldman Sachs, where she led various data initiatives and led the company towards being data-driven. While working at Goldman Sachs, Salma encountered various data quality issues, which eventually led her to start Sifflet Data. Salma holds an engineering degree in applied mathematics and statistics, and she describes herself as a mathematician by design a data nerd, and a recovered investment banker. Welcome to the show Salma.
00:00:38
Thank you so much Aayush. Thank you for the introduction.
00:00:41
Salma . Let's start with a little bit about you. Tell us a little bit about your journey from Goldman Sachs to now starting this company called Sifflet Data. Right? So, help us understand a little bit more about your journey.
00:00:52
Yeah. So as you mentioned, I started my career at a US investment bank, Goldman Sachs. I initially joined as an analyst in the equity sales and trading division. and as you can imagine the trading floor is a highly data-intensive environment and environment where in general decisions are made in real-time and big decisions that involve a lot of data and a lot of other elements and so I was very quickly thrown into a kind of environment that relied on data to make decisions real-time and more importantly, where data quality issues can very quickly lead to a lot of problems. So I had the first seat to everything that can go wrong as far as data quality or like there off goes. So I was in equity access trading and I was responsible for a sales team. So I was a revenue leader and then turned into a technology and data leader where I led a couple of key initiatives to help make the team and the sense trading division more data-driven. And actually, the bank was investing a lot in infrastructure and, and engineering in general. And we had an amazing engineering team and great infrastructure, that said the access to data. And at the expense of using the buzzword, access to data was reserved for a very small group of technical people. And there wasn't any, data democratization, as we call it these days. So that was kind of the thought process behind the initiatives that I led. Which were aimed at making business users more responsible for their data, accessing it and being able to leverage it day in, and day out to make smarter decisions. And started Sifflet because I was, again as a data consumer and then a business leader, I was exposed to data quality issues a lot. It was costing a lot of resources the firm in terms of, lack of efficiency. Some errors would happen from now and then. And a lot of headaches internally and a lot of intense conversations. I started really, this was in 2017, 2018. We had a pretty decentralized type of architecture. It was a mesh without actually calling it a data mesh. So I started looking for solutions for my team that could help us get better at managing the quality of our data assets One thing led to another and I get even more passionate about the topic. And ended up leaving Goldman to start with, my two co-founders who I have known for a decade. That's where Sifflet came from. Wow, amazing. And you know if I'm wrong, you guys are based out of Paris? We are. We're based in Paris. We're most first remote, first company. So we have employees all over. And obviously, we work with organizations from all over the world.
00:03:25
Okay. And how big is the Sifflet team right now?
00:03:27
We're 20 ish. I'm afraid if I say the wrong number, people are gonna be pissed at me, so I'm just gonna say 20 ish.
00:03:34
Okay.
00:03:35
Okay. So, yeah, and growing super fast. We were three co-founders less than a year ago.
00:03:41
Wow. Wow. That's great. That's great. So Salma tells us a little bit more about it, as I've seen on the website Sifflet data is what you call a full stack data observability platform. So, Right. Help us break down those four words, full stack and data observability. Help us understand these things.
00:03:59
For sure. So let me start with observability. Observability is a concept as I'm sure that comes from software and before that comes from control theory. And the idea in software observability with what companies like Datadog and Newrelic et cetera, have created over the past decade was to help software engineers get a better sense of what's going on inside their applications and to monitor the health status of the different components of the applications. Fast forward to now where data adoption is bigger than ever, where companies rely on tens if not hundreds of external data sources where data is becoming central to every modern organization. Data engineers and data practitioners turn to software to learn a lot from the practices being kind of the more mature Brother or more mature cousin, I guess in the family of engineering in general. And so when we started From my experience at Goldman, but also my co-founders working for companies like Uber, Amazon, et cetera. As data practitioners, we were, looking for things to improve the reliability of our data assets and improve how the level of visibility we got over what was going on inside the data infrastructure in general. So when we decided to start Sifflet. The funny thing about us as a founding team is that I come from more like a data consumer slash business leader kind of background, whereas my co-founders come more from like pure analytics, engineering, data engineering software engineering backgrounds. So when we decided to launch Sifflet. We had different perspectives in terms of what we wanted the tool to look like, which ended up being a superpower for us because we are building a tool that can help both data engineers and data consumers navigate data reliability and data quality issues. So long story short and back to observability in data particularly our approach with Sifflet emulates a lot of the approach in software observability. In a sense that software observability is based on three main pillars. Metrics, logs, and traces. Data, in my opinion, is a much more complex type of environment. It has a lot of infrastructure elements for sure. But at the same time, there are a lot of other things that get thrown into the mix that make it even more complex for data practitioners, to navigate and for which, a framework that's solely based on infrastructure is not enough. And that's where data observability comes into the picture. So our framework would see phase based on three main pillars, metrics, metadata, and lineage. For metrics, we look at your basic data quality monitoring metrics or what constitutes the basis of, an anomaly detection framework. You wanna look at anomalies within the data itself. You wanna look at anomalies at the metadata level and you wanna look at anomalies at the infrastructure level. And I think that's also one of the key differentiators for Sifflet because we want to make data quality monitoring a bit more proactive. Cuz when you start detecting things about volume problems, a schema change problem, a statistic outlier or something like that. Our opinion is that's already too late in the process, right? Like, you're, you, yeah. In, an ideal scenario, you wanna be able to detect signals at the infrastructure level that could potentially lead to a data anomaly or a metadata anomaly. So, we have that dimension as well and the second pillar is lineage because, In my opinion, there is no, if you wanna have full observability, if you wanna have full visibility over your data assets and understand how the different assets are related and how they connect and communicate with each other within a complex ecosystem, you need to have a lineage. And so we invested a lot in our, Lineage capabilities, and that's where the full stack name comes from. We are so we sit on top of an existing data stack and we have connectors from ingestion to consumption, whether consumption is bi-analytics, reverse ETL, ml, or whatever. And we would, we can put the lineage automatically throughout the whole data stack so we can follow the whole journey of the data assets to essentially give context to the anomalies that we detect. And then the last pillar metadata is important to get a better understanding of how different you know, components or different attributes of the data assets interact with each other. To be able to get even better visibility and better overall observability there's a pillar that we don't talk about a lot, which we use also in our frameworks, which is logs. And this is very similar to the definition of log-in software engineering. And that's, what ties back to the infrastructure monitoring element because you wanna make monitoring more proactive and you wanna get insights from, the different applications where the data is running to be able to detect what could lead to a data anomaly before it's being processed and dealt with. So that's the framework high level.
00:08:36
Those are amazing points. Salma one thing that I would want you to dive deeper into is the lineage, right? So Lineage is, by its fundamental nature a tough problem to solve. It's a tough problem to solve, and it's more it's not just a tech problem, It's more of a people problem. It's more of a process problem. So there are a few moving elements that are beyond the control of technology. How are you tackling that? How are you, is there any level of automation when it comes to lineage discovering those lineages? How are you handling it?
00:09:06
Absolutely. That's such a great question and I agree with you a hundred per cent on the fact that lineage is a very complex problem. It's a very, it's a concept that still draws a lot of confusion. and it's actually, it's not an easy problem to solve. And anecdotally when we started Sifflet the very first lines of codes we wrote were for the data lineage because we knew from the get-go, and that's why we call ourselves a full stack or a full data stack observability platform. We knew that if you wanna have a bird's eye view of all your data assets, you need to know how they're connected. Because of the growing complexity of data infrastructures and the growing expectations from stakeholders and the business around what data teams and what data infrastructures do cuz we're not gonna lie data infrastructures are very expensive to create and maintain. So you wanna make sure that you get the best ROI on your investment on your data investments. And so we knew from the get-go that we didn't want to just do a simple anomaly detection framework that sends you when something breaks like we've been in the data practitioner seat in the past, and we know that is not enough. And that's how you get to a point where you get alert fatigue and people start ignoring problems and stuff like that. So, we started by building a lineage and it still is today one of the key pillars of our product. Now, back to your question. A hundred per cent lineage should be automated, otherwise, you shouldn't even be calling it Lineage. We speak to some customers that tell us about their pain points, how and around lineage and how they've been maintaining it with different CSV files. And I hear horror stories about people computing lineage manually. We sat through a presentation once from a customer that showed us how the lineage was done in a PowerPoint. So, every time somebody was creating a new data asset, they were documenting it manually and that was their lineage. It's it definitely shouldn't be, not like this. That's not scalable, that's not maintainable. That doesn't, back to your comments, it doesn't help nurture and create and foster a data-driven. Right, dealing with data becomes a problem and becomes a huge hassle. And so people are just less and less incentivized to deal with it in the first place. So, we compute lineage in a variety of different ways. As I said, we invest a lot in our lineage. We have a team dedicated to that specifically. It's the backbone of our product and we make sure that our algorithms are as deeply connected as possible so that we can reverse engineer any type of information we get from connecting to the data platform. So, concretely speaking would have, we have connectors across the whole data stack that collect logs information that reverse engineers the code that has parser and stuff like that and looks at the metadata and a lot of other elements and, Computes automatically lineage information and updates it in near real-time so that users will never have to worry about how the data assets are related or God forbid somebody deletes something and it breaks everything else for people downstream. So that's but, but I feel like Lineage is again it's over it's an ever-evolving topic. And it's an area where there's still a lot of innovation to be done, in my opinion. For example, we wanna be able to expand the le to even the infrastructure layer. You wanna be able to go a bit deeper into the orchestration, the transformation, the modern layers. You wanna go into the semantic layer. So there's an infinite way. There's an infinity of ways you can do the lineage and improve it. But to your point about automation, I mean you can't even call it a lineage if it's not automated.
00:12:40
Yeah. So, Salma I think, so you, you briefly mentioned this term called alert fatigue, right? And that's a real thing. I've spoken to a couple of data leaders and one of the initial concerns and pushback that comes with having data observability or a data quality monitoring tool implemented is this whole fear. Data or alert fatigue, right? How do you filter out signals from this huge spectrum of noises that might gender because of anomaly detection due to, tons of notifications and all this kind of thing, right? So my first question to you is you mentioned that you saw a certain level of alert fatigue by using Lineage because once you understand the relationship between data assets, you would be able to give more contextual information when it comes to alerting. So the first question to you is, how are you dealing with, alert fatigue apart from lineage? And then the second question and a kind of a follow-up question for this is how do you get internal buy-in and build the case for a need for a data observability tool within any organization?
00:13:42
Another great question. So alert fatigue. The easiest proxy we can make about, alert fatigue and the implications it has and how it's and at least that's where that's how we learned to improve our model to fight alert fatigue was looking at cyber security. Cause that's an area where you know there's a lot, It's an area that's slightly more mature than broadly speaking, data governance and data observability. And where a lot more attention is given to alerts because no company today is safe from a cybersecurity attack. So we looked a lot at that space and we tried to learn from them how to make sure that our Anomaly detection model was not creating extra work and extra alerts for the user. And it was actionable. We loved the word actionable at Sifflet we use it a lot. And yes, Lineage helps a lot in the sense that it gives context to the anomalies that a tool detects and tells you what you can do on behalf of the alert. So where is it coming from, how it's impacting the user, who's looking into it et cetera? So you can build, an incident report where you can follow each incident and know exactly what can be done to remediate the accidents and more importantly, to avoid it from propagating and from happening in, in the future. The second big element in fighting alert fatigue is and again, this is more still about technology. I'll give you also an argument that's less related to technology, which in my opinion it is often overlooked. But still on the technology element. I think a lot of tools and a lot of ways to do Anomaly detection, wanna focus on ML and apply it in getting better and getting smarter and doing more automated ways to deal with Anomaly detection. Right. And that's great because you can cover a broader variety of use cases. You can automate a lot of the workflows, but you can't do ML detection if ml-based anomaly detection if your ML model is not robust enough and equipped with a very solid feedback loop. Otherwise, it's a recipe for disaster. And that's how you create a lot of false negatives and false positives and just a bunch of random alerts that are not filtered or presented to the user in a way that helps them. Trust the tool that, that, the monitoring for them. It's very funny and very related to the psychology of humans. Cuz when you get, an anomaly detection tool or observability tool you use it to trust, to achieve more trust in your data and your data infrastructure. Right? But if you don't trust the performance of the tool, then you're not gonna use it. And the adoption is gonna be very poor. And for you to trust the tool, you need to experiment with all the alerts you get from the tool and how efficient they help you get around monitoring your data assets. So again, without making it about Siffelt specifically but we invested at a time in our ML-based anomaly detection engine, specifically because we wanted to avoid landing in the trap of alert fatigue. And in our ML-based anomaly detection engine, you, so obviously have a very strong feedback loop, but you can also, it's built in a way that it gets smart. And learns from the anomalies that the user detects and learns from the actions that the user is taking on behalf of the alert. So back to the lineage part. So overall, it makes sure that all the alerts are dealt with and that, all the alerts are relevant. The final point is more related to people and internal evangelism. If you build the culture internally, and this is the job of data leaders, business leaders if you build the culture internally that you know, celebrate small wins about data quality, makes data quality almost part of the culture within the data team and within the broader organization, then people are more incentivized to take data quality issues and the alerts they get from anomaly detection more seriously. Unfortunately, there is no shortcut to achieving a good and healthy, data culture within an organization. It's a lot of work and a lot of small initiatives that are done repeatedly to ensure that people are incentivized. But you can't just rely on technology to do that, you need to have also strong adoption internally and strong messaging about the importance of ensuring good quality data. And fortunately on our for or unfortunately there's a variety of ways to compute. And this is back to the question of how to get the buy-in, and the variety of ways to compute the ROI of data quality initiatives. But also there's been, unfortunately, a big number of highly publicized data catastrophes of public companies that paid fines or made reported huge losses and stuff like that, that is, you can find by a simple Google search. And you'll see that if data quality is not taken seriously, it can have, it can get very serious and it can have serious repercussions on the business. And so about the buy-in, I think it's a matter of aligning, first of all, aligning the business objectives with the data team's objectives. And I think that's where a lot of data leaders get it wrong cuz they go and they invest a lot in, a modern data stack and they go and buy all the fancy tools, but they. They often lack that connection to the business that tells them, Okay, this is exactly what we need from the data team and this is why data quality is important. And again, there's a variety of ways to go about that and get internal stakeholder buy-in. But from my experience and this helps me a lot in my previous experience because I was a, like a hybrid business slash technology leader and I see it play out quite nicely with a lot of the customers that we're lucky to work with Sifflet is that the most successful or the initiatives that I see succeed as far as data quality and data governance go is when both stakeholders from the business and technology are involved in the discussion and in picking the tool for data observability.
00:19:29
Yeah, that's a brilliant answer. Salma and another thing that keeps coming Often are, the data has been, the whole evolution of the data stack has been very prominent in the past few years. What do you think has changed in the industry or from a technology perspective that data observability has suddenly become a hot topic? You have got a lot of companies raising a lot of money. There is a lot of investor interest in this area. There is a lot of even customer interest in this area. Why this has happened suddenly, What do you think? What has changed?
00:20:00
Yeah, the category is, although it's new, it is getting a lot of attention and it's maturing pretty fast, which is great to see obviously as a, as the founder of a company that's in the space. But to your question about what caused this proliferation of tools all of a sudden, to be completely honest and transparent with you and I'm sure you can. Agree and relate to this with me. Data quality issues existed for as long as data infrastructure existed, or as long as people started using data to drive business initiatives, right? What happened though over the past decade is that there has been a huge evolution of revolution that took place in the data engineering and the data infrastructure and tooling space. That's made a lot of the processes that were taking a lot of time and a lot of manual efforts from data engineers made them automated and possible, right? So if you look at the whole like, Existence or, or, or the growing adoption of the modern data stack is a lot to thank for that because now you don't need to write a manual script to create a pipeline anymore. You don't need to make your dashboards manually. You don't need to. Now you have even reverse ETL. You can loop the data back into the business. And there are a lot of fantastic technologies that have emerged and automated a lot of the work of workflow for data engineers. And shifted their focus from thinking about where we even start to create a data infrastructure to now being like, Okay, I know I can buy this tool, that tool, I can still build this, and I will have this functioning data stack that takes data from its role format to insights. Now I can focus on how to make this more reliable, more efficient, and more cost-friendly. And I think that's where a lot of like the, there had, there was a shift. In the mentality for data leaders because now they have the bandwidth to look at how to make their stacks more reliable, how to improve the ROI of their data infrastructure et cetera. So that's one part. And I think that's what motivated a lot of business and data leaders to start looking for ways to automate data quality and automate data observability because if we're honest, nobody's starting from scratch when it comes to data quality. Like everybody uses, basic testing, you test the pipelines, you, there's a lot of amazing open source solutions that you can adopt to get started. You start to have real problems and the need for data observability solutions when you start to. Right. And that's when and when you get to the level where you are scaling and things start to break a lot and lineage becomes important, for example then because you know that you're covered on all the other compartments of the data stack then becomes the question of, okay, how do we optimize this setup? And that's where data observability comes into the picture. Now, back to more technology. I mean, Again, there have been a lot of solutions in the past to tackle data quality issues. Your Informatica Talend, IBM, et cetera. That's where adapted to the technology that was prevalent at the time. Now with the evolution of the modern data stack and the way we're doing things it made it easier to come up with solutions that can detect anomalies in a modern data stack environment. And I think that's why there was an explosion, or there is an explosion of amazing technologies in space trying to tackle this problem. Because, like if you stick to a, as I like to call it a snowflake and data stack then you can relatively easily do some basic observability and testing with dbt and with other tools. And it becomes just a matter of automating that. The real struggle and I think what's gonna drive the future of the category in general is covering the very complex use cases, which is still, in my opinion, more than 70% of data stacks of companies globally, right? Like people who are still between modern and legacy people who rely on a big variety of data sources. People who are still, who you know, are still between like, data warehouse, data lake, data mart kind of infrastructure data mesh and decentralization is the big topic, is a big topic right now. So, I see more and more data leaders and modern organizations wanting to embrace the complexity, and I think that's gonna be the challenge and that's gonna kinda shape the future for data observability as a category because you're gonna have to adapt to that as a vendor if you wanna grow and if you wanna solve the problems for most
00:24:14
organizations. Yeah. Salma another thing that we keep hearing a lot and tell me if it's a myth or if it's real is. We keep hearing that tools like data observability and tools like tools, any tools like Sifflet data, are more suited for big enterprises. It's not something that data observability or data quality monitoring is not something that's probably P one or P two priority for any mid-sized company or a small-scale company. Is that a myth or do you disagree with that?
00:24:41
I think it depends. I think if we just stick to the concept, observe of observability and we, and what it means, which is achieving full visibility, and reliability of your data assets and having a good, like, level of confidence in your data assets. Just the concept itself without looking at, vendor or it's something that any company, regardless of its size should strive for right now. How do you go about achieving observability? Whether you do it with, as I said, like using open source solutions, building something in-house internally, or going and investing in a full data stack observability to like Sifflet that depends on the scale and the growth rates of the organization and the overall objectives right? So it might not be for any size of company that I agree with you on but I, but the concept itself applies to any company that deals with data. It becomes a question of figuring out and making a trade-off between the resources and the level of sophistication you need from a solution. We do not just work with large enterprises. We work with companies, from any company that has a data team and. Scale to a certain degree.
00:25:48
Right, Right. So, another thing, Salma what we are also very keen to understand from you is, you started from data quality monitoring and now, you branched into a couple of edges in things like around metadata management and lineage, and there are, few a companies and tools that are explicitly for metadata management or let's say only specifically for Lineage. As one of the parting questions, as one of the last questions for this interview, Salma, what I want to understand from you is what do you think is the future for Sifflet data from a product perspective, from an offering perspective? What are those next big things coming from Sifflet, are you looking to, branch out to adjacent things like, as you said, metadata and lineage, Are you looking to branch out to adjacent things and have like a full, full stack kind of a solution that you already talked about or you would want to stay very specific around, data quality monitoring and data observability.
00:26:46
Well this is a great question. So without giving away too much of our DNA with Sifflet Yeah. At least for the timing and will remain for the next two, or three years, are monitoring and observability, right? We're an observability solution. We're full stack. , we sit on top of an existing data stack and we ensure that everything is reliable, trustworthy, and visible. Right? We have this concept that we are trying to bring to the world. And if you follow us you'll see that it's the name of our newsletter at our conference. We call it data entropy, Right? So broadly speaking, Sifflet is on a mission to reduce data entropy. Right. And entropy in data can manifest itself in a variety of different ways. Entropy can mean a lot of data quality problems for a company. Entropy can mean lack a lack of visibility, for the data assets. Entropy can tell that they're still using a PowerPoint for Lineage, for example, and so Sifflet as a company is on a mission to reduce data entropy within organizations now back to what you mentioned specifically about metadata management and stuff like that. I think overall all we're not going to go and pretend to do one thing better than another or, go and branch out to other products. At least not for the time being. But we will definitely as part of our product vision and our product roadmap, we will invest in every compartment that helps achieve overall better observability of the data and the data assets and reduce the entropy, within the data and the data infrastructure. I don't know if that answered your question but I'm, but I'll be happy to dive deeper into this.
00:28:29
No, I think so. That did, I think so that yeah, I think so. That did answer my question. And thank you so much for giving all of these candid answers.
00:28:38
My pleasure.
00:28:38
We, I had such a lovely time having this conversation.
00:28:40
Me too. Thank you.
00:28:41
So thank you. Thank you again so much for your time, Salma and we are looking forward to publishing this episode.
00:28:47
Amazing. Thank you so much for having me.