Mar 28, 202332 min
Share via:

S02 E06 Breaking Down the Buzzword: What Data Mesh Really Means for Organizations with Colleen Tartow, Director of Engineering at Starburst data

With the introduction of the Data Mesh concept a lot of people are trying to wrap their heads around the term, In the latest episode of the Modern Data Show, Colleen Tartow Director Of Engineering at Starburst Data provides a comprehensive explanation of what data mesh actually is, the socio-technical aspect of data mesh and the fundamental shift in the way data is produced and governed within an organization.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Colleen Tartow
Director Of Engineering

Colleen Tartow is Director of Engineering at Starburst with total of 20 years experience in data, advanced analytics, engineering and consulting. Colleen is a true data leader. She has a wealth of experience in assisting organizations in driving value from data driven culture and has successfully led data and engineering and analytics streams throughout her career. She has helped organizations to unlock the value of distributed data by making it fast, and easy to access no matter where it lives.

In this episode

  • Advantage of having Trino managed via Starburst Galaxy
  • What is a data mesh
  • Socio-technical aspect of data mesh
  • Where data mesh doesn't work
  • What are data contracts

Transcript

00:00:00
Welcome to the Modern Data Show, where the data leaders come to discuss the real-world challenges and opportunities of data management and analytics. Today we are thrilled to have Colleen Tartow as one of our guests. With over 20 years of experience in data, advanced analytics, engineering and consulting. Colleen is a true data leader. She has a wealth of experience in assisting organizations in driving value from data driven culture and has successfully led data and engineering and analytics streams throughout her career. We are excited to have us today to share her insights, experience, and how she's helping her organizations to unlock the value of distributed data by making it fast, and easy to access no matter where it lives. Welcome to the show, Colleen.
00:00:43
Thanks, Aayush. It's great to be here.
00:00:45
So, Colleen, let's start with a very basic question. Can you tell us a little bit more about your background, especially your background in astrophysics and how it led to your current role in data and engineering?
00:00:55
Yeah, this is a fairly common question I get because it seems like astrophysics and data leadership are not that related. So I was an observational astrophysicist in college and then in graduate school, and I got my PhD. Actually, in Starburst galaxies, I worked on dwarf Starburst galaxies, which are in the nearby universe and they're galaxies that have explosions in the form of bursts of stars forming all at the same time. And then they age and then they all explode at the same time and they create these outflows into the universe. It's very interesting. But what I was doing was I was going to telescopes collecting a ton of data and then analyzing that data. So it's data and analytics at its core, and it really gets into like the idea of data storytelling. But you have to do all of the same things that you're doing these days in data and analytics and true back then. You would go to the telescope and you would literally fill your suitcase with physical tapes of data that you would then compile and load onto your machine that you were the CIS admin for, and you would then have to like clean and process the data. And then you would, load it into some target format and do some analytics on it and, try to tell stories about the universe using that data. So it's actually not that different from what all data practitioners are doing, right? You're taking some source data and trying to get some value out of it to make insights, whether it's about your business or the universe. So that's how they connect. My career's kind of taken a winding journey. I sort of randomly ended up in like the B2B enterprise data, software space. And yeah, it's been a really interesting ride though. And the last few years I've been focused on data leader engineering leadership and a data thought leadership position at Starburst data, which has been really fun and it just, it gives me this perspective on data that I think comes from a different place, but is very applicable.
00:03:03
Right and I have a tons of questions around Starburst data, and data mesh. But before I jump into that, why don't we start with a very quick overview of your current role as director of engineering at Starburst, and start with telling us what Starburst data is, what's your current role is, and what are the kind of things you're currently working on.
00:03:24
Yes, absolutely. So Starburst is really focused on accelerating the time to the insight of data. And so that's what we do for our customers, help them build insights from their data with fewer steps between the data source and the data target. And Starburst is built off of the open-source Trino, which is formerly known as Presto SQL, which is this best-in-class SQL-based query engine. So with Trino you can query your data wherever it lives whether that's Teradata in a legacy on-prem system, on bare-metal, or you've got some cutting-edge iceberg cloud-based object storage or anywhere in between. And that data can be queried by the engine without actually building pipelines to move the data. And so if you think about it, that's incredibly powerful because you're reducing latency, you're reducing risk, you're not copying the data as much, and you're just reducing the complexity of that data life cycle. And so the Starburst Enterprise Platform is just that it's an enterprise platform built off of Trino. So we take Trino and then that's the base, and then we add all of this functionality on it. I like to call it Trino on steroids. But we build this entire platform on top of it. So we're really leveraging the power and the performance of Trino to query the data. And then we add a whole data platform on top of it, which is really cool. And then the other question you asked is what I do at Starburst. So I lead our enterprise engineering organization which is really fun cuz I get to work with some of our founders. I get to work with really incredible engineers, some of the best engineers I've ever worked with. And I get to do like a mix of strategic and tactical work. And I really just focus on making sure that the Starburst Enterprise platform is the best choice for our customers. And so that means I'm working with technical leadership products or founders, marketing sales, like everyone, just to make sure that like we all are on the same page about how our software is differentiated in the market. And then leveraging that differentiation to build the best product we can for our customers.
00:05:35
Okay so amazing and one question from a technical perspective. What would invalidate the statement saying, is Starburst a managed version of Trino? Is Starburst just a managed version of Trino?
00:05:53
So that's a good point. So I just described Starburst Enterprise, which is for our enterprise customers. So huge companies that have like a data infrastructure team. They have the capability and the drive to Install, either Trino or Starburst Enterprise and get it into production. And it's a very complex system. So say you're a bank, right? That's what you're gonna use. But not everyone has a data infrastructure team, right? And not everyone wants to do that. And say you're a more modern company, you're in the cloud, right? So what we have is a managed and hosted version of Trino as a service, and that's called Starburst Galaxy, and that's our SaaS offering. So we actually have two products, Starburst Enterprise, which is what I lead. And then we also have Starburst Galaxy, which is a newer product and it's just amazing, right? It's just really incredible because you go to our website, Starburst.io, and you click on Try Galaxy and within minutes you can be connecting to your data and run SQL on it. Within our built-in query editor and connecting to any third-party integration you have with Tableau or something, and within just minutes you're querying data from object storage, from Postgres, from wherever you've got it in the cloud. And so Starburst actually does offer both the managed service and the self-hosted, self-managed service.
00:07:17
Right and diving a little bit into Starburst Galaxy, Starburst Galaxy, right? What were the core technical challenges that kind of laid ahead in front of your teams in terms of being able to provide Trino at a scale to your customers? What were the core technical challenges that you have to come to be able to do that? Because a lot of people think that you have this open-source platform that you can host, you can develop, and you can run on it. But when it comes to building, and scaling this kind of system at a scale, a production level infrastructure, there are challenges that come across. So if you were to tell people in terms of what would be the advantage of having Trino managed via Starburst Galaxy versus hosting your own instance, what would be your point?
00:08:12
I think the managing and maintenance of hosting an open-source software yourself like that's a fair amount of effort, right? I don't know if you know the listeners have ever done that, but it's not the most simple thing in the world. And then scaling it, you're relying on the community to help you out, right? Which, Trino has an incredible community. Don't get me wrong. You're still relying on, the time of others. And so with Starburst, obviously we have like award warnings, customer support and professional services and all that good stuff. But then we also have literally all of the people who built Trino work at Starburst , so including the original creators. So we are really the Trino company. And so we've used that knowledge to really build this SaaS platform with Trino specifically in mind. And so I think that would be challenging for just anyone to build and manage and maintain and host that environment. But for us, because we know Trino so intimately, we actually have very specific knowledge that allows us to do that at scale in a really well-performing way. And the galaxy has really taken off. Like our customers are really excited about it, but the speed with which you can get up and running is really an important point. That's what people love about some of the other cloud data warehouses out there, like your snowflakes and all of the other ones that are out there is that, it's minutes to get it up and running. And so that's what Starburst Galaxy offers as well. And then you don't have to think about like maintaining a cluster and your catalogs and all that. You can focus on the insights that you're getting out of your data.
00:09:54
And you just mentioned, one of the, most of the core builders of Trino are actually in our team, part of the Starburst team. Tell us a backstory. How did that happen? How did it all start? Give us a little backstory on that.
00:10:09
Yes with the caveat that I wasn't there. So there were four founders who were experiencing the problems that everyone experiences with the idea of ETL and ELT at Facebook. And it's my understanding that, their pipelines would go down and it would take them days to recover, which is probably a fairly similar story to a lot of listeners And I personally have been in that situation. It's. Very stressful. It's unfortunate, but there's actually like a business hit that you take when that is, that happens. And so the idea is to apply a standard MPP architecture to the data, have a centralized coordinator, and then have a distributed system that your query engine runs off of it. And so they built that at Facebook in I think 2012. They open sourced it in 2013, I think it just hit 10 years. And it was open source, Presto. And then Facebook had a fork, which was Presto db. And then the open source, the true open-source fork was Presto SQL, which is what Starburst is built of. And then fast forward to the people who worked on the project initially at Facebook. Now work at Starburst, Martine, Dan, David and Eric they're great. I was talking to Dan yesterday, they're all wonderful. And so it's been really cool cuz we also, a lot of our other founding engineers are like early adopters of Trino that really understand it incredibly deeply. And so it's been, it's great working with like the hive mind of the world's leading experts on this product.
00:11:47
So amazing. And I remember there was some controversy around the name Presto and Presto SQL with Facebook and the chain was named, it's good, I think so now people know Trino name is kinda equally well comprehendible. Yeah. Congratulations on that pulling that out was a tough thing, but congratulations to you and your team for being able to do that. One of the other things, you mentioned to me that you also wrote a book on data mesh, right? And data mesh is probably a term that gets an equal amount of love and hatred as much as the term modern data stack itself gets, you have lovers, you have haters, right? So I think, so I understand I feel your pain as well. Let's start with the basics, explain like a five-year-old, what is data mesh?
00:12:41
Alright. I have an almost five-year-old that I can tell you I do not think he understands data mesh. So going, backing up a moment, so there's no end to the amount of data we are collecting, right? And it grows exponentially every day. You can imagine the curve that you see on that chart that everyone always posts, which is like the growth of data year over year. Organizing, maintaining, and managing that data is an incredibly challenging as a problem, right? It's not just like how you get insights and how you keep the quality of the data through the stack and things like that. It's more around like who owns the data, right? Who is responsible for it? And that's very clear with products, but it's less clear with data. And so the idea is if you start to treat data as a product, you'll actually get the value you need out of data. I read some statistics that 87% of data and analytics projects never make it to production. I don't know if that number's true. I don't know how they're measuring that. I'm always sceptical of data but definitely, I don't think that's far off. I've seen a lot of failed projects over the years, or projects that just take years longer than they should because it's so complex. And so the idea of data mesh and data as a product is to really treat data as a product, which means applying product thinking to it, and normally, most data paradigms like data fabric or centralized data warehouses or whatever you have, are focused on the idea of the data consumers or the ones that you're getting and value out of the data. Everybody else just throws the data at them, and then they use the data to make these amazing insights. So instead, let's apply product thinking and make the actual producers of the data responsible. For the end use of the data. So they're treating data as a product. And so flipping that on its head, you're like, okay, that means we need to like, organize the people who are actually producing and consuming the data into these business units or lines of business or, organizations called Domains. And that's just a group of people with similar interests in the product. And then you also need to think about it. They're not necessarily gonna be experts in infrastructure. So like you might need a central IT team, you're really thinking about the idea of a self-service data infrastructure. Not self-service analytics, but self-service data infrastructure. And that can be Starburst, it can be any number of things. As a side note, one thing I like about Starburst is that we haven't pushed the Starburst aspect of it. We're more interested in the data mesh itself. But that said, from there then you start to think about governance, cuz that's always the thing you have to think about, right? So really the idea of federating that governance and the responsibilities around that governance and making some of the onus on the data producers, but then some of the onus on the organization as a whole. And that federated model really comes out in a data mesh. And so data mesh is really those four principles, domain-driven architecture and organization data as a product. The self-service, data infrastructure, and federated computational governance. And so if you're really building a data architecture and organization and that both the people and the technology around those ideas, you end up with this thing called the data mesh.
00:16:11
What would you have explained is basically more from the sociotechnical aspect of data mesh. It's a fundamental shift in the mindset. It's a fundamental shift in terms of the way data is produced and governed within an organization. What changes from a technical perspective?
00:16:31
Ah, interesting. Yeah, and I do think that, I love the word sociotechnical cuz it really does get at both aspects of it, like the people are a key part of it. And then from a technical perspective, what you're really saying is you're no longer centralizing your data. You're not trying to have some central data warehouse. Where you're throwing all of the data and then trying to curate it and then trying to pull it all together and have this centralized team of people who understand all of the domains, cuz that is not scalable and it doesn't work. It doesn't even work at small companies, let alone enterprises. And so the idea is that from a technology standpoint, each domain uses what makes sense for them right? And each domain understands its use case, but they're all presenting data products in what Zhamak Dehghani the founder of the data mesh calls the mesh experience plane. And so the idea is that the consumer sees a unified experience, they see their data products being presented to them, governed for them in this infrastructure or architecture with a tool that makes sense. But on the other hand, the domains underneath don't necessarily need to be using that same infrastructure, right? They use whatever makes sense for them. So some might have streaming data, some might have legacy data, and some might be cloud-native, right? So it depends on what's best for the domain.
00:17:54
what would be the cases where you think data mesh for an organization doesn't work? Yes. Probably one, one access to that would be the stage of the company itself. Wouldn't make sense for a very early stage company to us, but at a comparable stage. What would be the cases where you think, People are fine with having that kind of a traditional ELT architecture where you have, data sources. You pull that all into a data warehouse, you plug in on the top of a BI layer to be able to, produce data for the consumers. And you have a kind of a centralized data product or data platform teams, which is owning this infrastructure, what are those cases where this does not work?
00:18:38
Yeah, you've really hit on one of my favorite questions, to be honest, so I've worked at tiny startups, right? And in tiny startups, you have maybe one person, two, if you're lucky, who are focusing on infrastructure for data. And so they're not gonna build a data mesh. A data mesh is not appropriate for that kind of company, right? And a data mesh is not for everyone. But say you, that company grows. You hire more data engineers, and more analysts, you hire some data scientists, and you start to have a larger and larger organization. At what point do you start to say the centralized team isn't working? And I think that's unique for every organization, but I've been at companies where, I've been leading a data team and I've been like, there's just too much knowledge for us to handle at this point, and I need to start embedding. The curation within the domains. And I think organizations need to look out for that moment where it's gotten too much and you're starting to not be able to deliver insights in a time scale, that they're actually actionable for the business. And so that's the point where I do believe you need to start thinking about decentralizing the data team. And so it's gonna be different for every business and it depends how many domains you end up with and how many sources of data are actually coming in. And then it depends on the insights you're trying to get out. If you're doing data science, that's very different than if you're like an old-school organization using Cognos for reporting, right? So, I think it depends on the business goals as well as the team serving that. But there will come an inflection point where you can no longer do that and that's when you start to think about data as a product and then the data mesh falls out of that paradigm.
00:20:29
Okay. So now within this particular context, there is another term that kind of keeps doing, going around in the rounds, and I'm not sure a lot of people have clarity on that. What is data contract? What the hell is that? What the hell is that?
00:20:46
Yeah. Okay. So data contracts. So Chad Sanderson has done a ton of work recently on this and he was actually speaking at Datanova, which is the Starburst data conference last week. And he and I were having a great discussion about this cuz he just has so many cool ideas about this. But a lot of people have come up with the idea of data contracts and in some aspects, they've been around forever. And in some aspects, this is totally new. And it's basically the idea that the data consumers and the data producers need to talk to each other and actually agree on what the consumers need and what the producers are producing and that they should actually work together instead of just never talking to each other. And it's ridiculous that they never talk to each other. But I've seen a lot of organizations where that's true. And they might not even know who the other people are. But this gets at the sociotechnical part of it, right? That you need to actually remember that there are humans involved here. And so if you're telling the producers, this is what we need, and the producers are saying, okay, this is what I can give you, the data contract is the physical thing saying, okay, we've agreed on that. And so when you're treating data as a product, that's really important and in some ways. Things like behavioural data have done this for a while, right? Like clickstream data has always come in a format and you're assuming it's gonna be in that format so you can do your market analysis on it. But a lot of other types of data just haven't had that relationship between the consumption and the production of the data and understanding that the consumption relies on the production of this kind of data. So in some ways, it's a schema, in some ways it's a contract In some ways, its a plain old people talking to each other in agreement. But I do think it's really important to think about the idea that we're all in this together and we need to have these conversations.
00:22:40
And isn't it kind of a chicken and egg problem? Because the consumption of the data comes in really much after the production of the data. And you know in often, in a lot of cases, one of the things that we keep hearing is, until, unless you have data, you really don't know what you're looking for. And isn't that the social aspect of data and the way people deal with data counterintuitive to this whole idea of, data contracts?
00:23:10
Yeah, and I think there's definitely a maturity curve there where, in the early stages of analysis, basic analysis, or you're doing like true data science. There's a discovery phase, right? And so the data contract might just be that like, Hey, I'm updating the data daily and I'm giving you everything I have, right? It could be a very simple data contract, but then later on when that data is being used in production to drive a model that drives the business, there's a much more strict contract. Because if that gets violated, Data science might not be able to deliver, which means your product could be affected. So it depends on the maturity of the analysis and the consumption of the data and the maturity of the production of the data.
00:23:52
Amazing. So Colleen another question that, we often feel Companies, especially around, let's say a series A stage where, you know, companies who have just started to find their product market fit and are seeing that, the explosion of the data really coming in within the business, within the services, within the platform. What's your advice on people for the people who are pretty mature in terms of their journey as a company, but are still starting up with the journey for data? When do they, how do they, and what's the best way of starting? Because what happens is one of those early decisions that you make as a company sticks for a long time. How do you, what would be your advice to people in those early stages on thinking about, whether going with something much simpler, you have an ETL pipeline Fivetran, put it into a data warehouse and just build your BI tools versus thinking some started to think about something more advanced, like data contracts and, data mesh. What would you advise?
00:24:58
I can't imagine there are a lot of series A companies that need to worry about data mesh at this point, but you never know. The thing I would say is to focus on optionality. And what I mean by that is, Don't lock yourself into any one architecture because that's where people struggle, I think, is when they get to that inflection point that they need to decentralize or they need to rethink their data architecture around, for the sake of keeping it to startup language series C or D and you wanna treat data as a product at that point. you might be locked into an architecture or technologies that worked at Series A, but they get incredibly expensive or are in a proprietary format that you can't get out of easily. And when you've got some later-stage startup, you're not gonna want us to spend a huge amount of money on a digital transformation like some enterprise would, right? Like you're not gonna wanna completely revamp your architecture, but the space is evolving so quickly that I think there are a lot of axes on which it makes sense to not lock yourself into someone else's infrastructure, someone else's architecture, someone else's format, right? And that speaks a lot to why people love open source, right? Because it evolves quickly, but it also gives you the option to stay flexible in how you're doing these things. So you know, if your data's an s3, that's great and you can make different choices about how you store that data or the formats you're using. It's really your choice what you're doing with it. Whereas, certain external vendors that are SaaS platforms are great because they handle a lot of the infrastructure for you, but you're locked in. I don't love the idea of giving my data away to anyone. I wanna contain ownership of my data. I think there are different pieces to that, but I think optionality is the key.
00:26:50
That's great advice. So Colleen, another thing that we are seeing in the industry is, we had this Cambrian explosion of tools back in 2021 and 2022. And as the funding winter came, you're seeing those consolidations, you're seeing those, the modern data stack becoming not so modern now. But still, I think so in these phases. I think so. We have seen a lot of interesting paradigms emerge. We have seen interesting businesses emerge. off late, what are the few things, around the modern data stack that excites you, which you are super bullish on? What are those things? What are those few things that you're super bullish on?
00:27:37
For me it's anything that treats data as a product. And that can, depending on where you're coming from, that can be a variety of different things. But I think there are two, two things I like to think about. One is shortening the time and the complexity of the pipeline of data from the source to the target, right? So I think that if you look at the modern data stack landscape, that image that goes around every year, I forget who creates it, but it's basically it used to be a hundred tools that envelope the modern data stack, and now it's like hundreds if not a thousand different things, right? And it's naively you would be like, I need to pick one thing from every category. That's a lot, right? So it, it's a lot and it's too much to manage, but it's also incredibly complex. So like finding the tools that really help you answer the business questions you need to answer as quickly and as seamlessly as possible is really the point. And I think there are things like data contracts that are really important. I think there are things that, there are ways to consolidate multiple categories that you can think about and, I'm not gonna pitch for Starburst, but I do think Starburst allows you to do several of those different steps together at once, which I think is really important. And obviously, we're focused on simplifying that pipeline or getting rid of it altogether in certain cases.
00:28:57
Colleen, as we move forward to, wrapping up this interview there's one question that, we would love to hear our thoughts on and hear your opinion on, is what would your advice to young engineers, young people who are starting individuals who are starting their journey with data as for individual practitioners, what will be your advice?
00:29:22
Oh, that's a great question. I think learn SQL, right? Understand SQL, because if you understand SQL, you understand data. SQL is the lingua franca of data right at this point. And maybe Python, but definitely SQL. And I think if you don't know SQL, you're definitely missing out on getting your hands dirty with data. So SQL and then also Joe Reis and Matt Housley just wrote a book, Fundamentals of Data Engineering. It's an O'Reilly book. I think that's it on my stack right back there. Joe gave me a copy last week. But it's a great discussion of the stack and why we're doing all of these things and really, what is data engineering all about? Why is it different than software engineering? What is the thought process that you go through with that? So I think that's a great book that's already becoming a classic, but then also there's just so much information out on YouTube. So find people like yourself and other podcasters and conferences that you can get for free online. Just absorb, right? Listen and absorb.
00:30:36
Amazing. And before we let you go today, Colleen, one last thing. Where can people reach out to you? What would be the best way for people to reach out to you for, anything?
00:30:47
Yeah, LinkedIn is the easiest. Probably. I'm all over LinkedIn. I've always got a window up. Yeah, I'm just Colleen Tartow on LinkedIn. I don't know what my
00:30:57
We'll share the socials with the episode. Colleen, thank you so much for being a part of this episode. And it was such a pleasure to have you as a guest. There are a lot of things that we learned about. Finally, I've been trying to get my head around data mesh and contracts. Thank you for clarifying all of that up. Thank you for your time, Colleen.
00:31:17
Thank you so much Aayush. It was a pleasure. It was delightful.
00:31:19
Same here.