Feb 21, 202337 min
Share via:

S02 E01: A deep dive into the world of Data Streaming with Kai Waehner, Global Field CTO at Confluent

In this episode of the Modern Data Show, host Aayush Jain is joined by Kai Waehner, the Global Field CTO at Confluent, to discuss all things about Apache Kafka, Confluent, and event streaming. Confluent is a complete event streaming platform and fully managed Kafka service used by tech giants, modern internet startups, and traditional enterprises to build mission-critical scalable systems. During the podcast, Kai discusses the benefits of using Confluent over deploying Kafka, the role of a global Field CTO, and the company's complete data streaming platform.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Kai Waehner
A deep dive into the world of Data Streaming with Kai Waehner, Global Field CTO at Confluent

Kai Waehner is the Global Field CTO at Confluent, and an expert on event streaming infrastructures. Confluent is a complete event streaming platform and a fully managed Kafka service, which is used by tech giants, modern internet startups, and traditional enterprises to build mission-critical scalable systems. He's also an evangelist, keynote speaker and a trusted advisor to customers and partners across the world.

In this episode

  • About the role of Global Field CTO
  • When not to use Kakfa
  • Data in motion
  • KsqlDB vs Flink vs Samza vs Beam
  • Confluent data infrastructure

Transcript

00:00:00
Welcome to the Modern Data Show. On today's episode, we are joined by Kai Waehner the Global Field CTO at Confluent, and an expert on event streaming infrastructures. Confluent is a complete event streaming platform and a fully managed Kafka service, which is used by tech giants, modern internet startups, and traditional enterprises to build mission critical scalable systems. He's also an evangelist, keynote speaker and a trusted advisor to customers and partners across the world. We're excited to have Kai on the show to discuss all things about Apache Kafka, confluent, and event streaming. Welcome to the show, Kai.
00:00:34
Yeah, thanks a lot. Great to be here. We have a lot of great topics to discuss today.
00:00:38
Okay, so let's jump into it. So first of all, could you tell us a little bit more about Confluent? Like is Confluent simply a managed version of Apache Kafka, or is there more to it than that?
00:00:49
Yeah, that's an awesome question. And actually, while. Confluent was founded by the inventor of Apache Kafka. Obviously, it's much, much more today. When I joined Confluent around six years ago in the early stages there, it really was selling support for Kafka expertise. Today it's a really a complete platform, so it's much, much more than just using data streaming for messaging also, but really the whole ecosystem around integration, data governance, and then obviously also if they moved to the cloud in the last years. The fully managed service is a key piece of that, but how we typical explain it really is that confluent is data streaming in a complete version everywhere and in a cloud native way. And this is the combination of what Confluent is doing and how it's also differentiating from just using open-source Kafka and managing that by yourself.
00:01:36
And what is the typical rational you know, that you have seen your customers having for using Confluent services versus going with their own deployment of Kafka? What's the typical rationale.
00:01:48
Yeah, there's a few of them.I mean, the number one, especially in the cloud, is simply that people have huge issues operating big data clusters. That's not just true for Kafka, right? But also for our technologies like Spark and so on. And so when you use a fully managed service, which is really completely fully managed, and you can focus on the business logic or integration logic and just hand it over to the experts and that also includes, not just operations to reduce the risk and the cost, but also the 24 sevens apart. Because most customers that come to us, they run operational critical systems that have to run 24 7 often for transactional workloads. And therefore, on the one side, it's truly about offloading the operations efforts and cost, but also reducing the risk. And then what I said in the beginning, like it's really. Much more than just Kafka itself, but the whole ecosystem because as soon as you use Kafka for more than maybe interest into a data warehouse, then you have questions about data governance, about security, about encryption, and all these kind of questions around that, that we provide both from a tooling side, but then also of course, the expertise, because just the tools alone don't solve the business problems.
00:02:56
Right. Right. And you know, one another interesting thing that I wanted to talk to you about is your role as a global Field CTO at Confluent, right. And Field CTO is not a kind of a very popular term and I'm pretty sure that not a lot of people would be aware of the term Field CTO. So can you help us understand what exactly is a Field CTO?
00:03:19
That's a fantastic question and I get it all the time. And so you are right today. The people don't know about that much, but it's coming up more and more as a new kind of shop profile, so it's not just a confluent, right? But you can look at many other software vendors in that space, like Snowflake or Databricks or Cloudera, VMware many more that are creating this role. And what does it mean. It means in the end that really it's a customer facing role where you work together with a lot of customers and with that see their use cases, their architectures, but also help with their strategies and with their roadmap initiatives. That's the one side of the role. I talk to many customers. In my case it's really global. Some other people only in a specific region. That depends also on the company. And with that expertise, then I can share the stories with other customers. But then also it's really about doing thought leadership. So if you take a look at my blog, for example, I'm sharing these case studies and lessons learned. So not just about what's good, but also about challenges and trade offs and pros and cons, and sharing all of this with others, and others is not just our customers, but this is really thought leadership for the broader data streaming community in my case. Or if you go to Snowflake, they're talking more about data warehouse, for example. But it's a similar story. And so it's really about this collaboration with customers, with partners, with internal teams. So that we all get better in data streaming because it's still the early stage for these cutting edge technologies. And so you people both internally in our company, like for new hires, but also our prospects and customers in the community needs education about what they can do of that. And as last part, and in my kind of role of Field CTO, I'm also a public spokesperson. So in addition to just blogging or doing presentations, I'm really also an official spokesperson for the company for doing interviews or for example, I'm also working with research analysts like Gartner and Forrester to make sure that they also understand what data streaming is so that maybe we can get a new metric render for data streaming in the future. And that's in summary what the Field CTO is doing and interestingly, because this question comes up so much right now while we are recording, I'm also writing a detailed blog post about this question and maybe we can link to that in the podcast then later, because it's really a super interesting question and also a very interesting job role.
00:05:34
Oh, absolutely. We would love to do that. And is solution consulting a part of the kind of overall profile or the purview of a Field CTO?
00:05:45
Yeah, so I mean, the difference is that what I'm doing really is not supporting the projects in deep dive right. So that's why we have consultants. Or we also work with system integrators, with partners that do their projects. And we only provide the data streaming expertise. So not on that level, but in the end you need consulting on every level. And as a Field CTO in my role I solve one of the biggest challenges , which is that I can speak to both the business and the decision makers and to the technology people. So, very often in my conversations and meetings with a customer there's the CIO level. . And there's also the elite architects, and there is the developers, and there is the business people that really need to solve their problems and all of them speak a different language and to get these nuances right. This is the big challenge and this is also my capability as a Field CTO because I rarely go deep into things like Kafka or Confluent Cloud. Now, instead of what I do, I talk about case studies which are interesting for the customer. So when I talk to a retail customer. Then I share case studies from other retail customers that do omni-channel sales. For example, when I talk to a bank and I explain them how another FinTech customer build a real-time trading app, for example, and this then is in a high level, this is where everyone in the company understands it, both the decision makers and business, but also the developers and then the specific teams. I can do a follow up and go deeper or if we go even deeper, then I bring in other people because I'm not the deep expert that's our consultants and other people. And but on this level that's really where I do the consulting and engagements to help the customers understand where they can leverage the technologies we have. And very important also where you shouldn't use them. This is also a very critical question and this is what you often don't find in the marketing materials of the software vendors, right? Because every vendor has the best tool and does everything right. But this is really also where people know that I'm trustworthy, right? So I explain them not trustworthy can help, but also where we need other technologies and how it's complementary. And this is also what people really like when I come to them that they know they can trust me and I will only explain where to use it, but also where not to use it.
00:07:55
Right. And I will come to that point in a while in terms of when to not use Kafka. But before I do that I have a question on How Confluent as an organization is supporting the broader Apache Kafka community? I know that, the founders of Confluent were the initial contributors to the Apache Kaka product, but as of now systematically. What are the initiatives you have within Confluent to support the broader kafka community?
00:08:27
Yeah, that's a great question. And for companies like Confluent this wouldn't work without the community, right? So, in the end, first of all our product, confluent Cloud is based on the same Apache Kafka that you use in open source and therefore What we contribute to Kafka is not just to our product, but also to the open source community, like the huge investment of ZooKeeper removal , for example. This was a multi-year project where most of the staff was implemented by our team. I mean, it's super hard and critical because it's a hard operation at Kafka. And these kind of things are not just for compliment, but computed back to the community. So that's on the code level , and also, for example a few open source vendors have very strange strategies. Like when they have a security fix, they only add it to their own product and then maybe half a year later to the community. So that's not what we are doing, right? So, and for that we also have a community addition, which you can use with an added features from us, but even if you use open source Kafka in the community, you can do that. And that's a technology level. And in addition to that, Today the normal community stuff you are expecting from open source and from cloud services, and we are doing meetups worldwide. During the pandemic it was all virtual, but now it's back to on premise and data streaming with Kafka is really such a defecto standard in the meantime around the world. So we are doing these meetups around the world to meet people. And this is really also, again, not just about confluent, this is about, and talks that interest everyone, like a Kafka meet is very technical. So we talk about the Kafka roadmap, about best practices and also we work together with partners, with customers, and even with competitors that speak there. Like maybe then let's say. Redhead or Amazon, which also have a kafka offering. And this is the big contribution. And this is in the end also the business strategy because most of our customers already use Kafka and come to us, right? Because they learn about the advantages, not operating it by themselves, getting support, getting additional features. So without a Kafka community we would be in trouble. So it's both a business strategy of course, but it's also the win-win for everyone because you can always also just use Apache Kafka. That's totally fine. That's the community adoption. So, only a small percentage of the Kafka community is then in the end adopting Confluent or another enterprise service. It's getting more and more in the cloud. But still not everyone is using the cloud service, of course. So, and that's totally fine. And this is the great thing about such an open source community.
00:10:50
Right. So that's an impressive thing. So basically what we are saying is the Kafka that is there within confluent ecosystem versus the open source version is almost up to date. Right? There is, they are almost always in sync. Is that a fair assumption? ,
00:11:09
It's even better than that for the community because in the cloud things changed a lot. So on premise with our confluent platform, it is exactly like that. So first, for example we the Cook data, the Kafka community releases Kafka, right? Where we have 80% of the project commitment. But it's the community also of IBM and others. And then after it's released, we need another four weeks before we release our platform because we need to add additional integration tests to integrate in our platform. But the Kafka you find in our platform then is exactly the same, and it's more or less up to date. Only the integration tests need a few more weeks. And the cloud, it's even better than that because of the cloud and the cloud where you can do rolling upgrades. So they're re-op operated and we have the control, right. We don't have to wait for the customer. And there is even better because I'll give you the example of the ZooKeeper removal, which is a super hard thing. And instead of just rolling it out so that the customers tried out in production, we are first trying out the zookeeper removal in our own confluent cloud services, first of all in development and then in some test clusters. But then we are also rolling this out in our cloud offering. The customer doesn't even see it because, and doesn't have to care because it's fully managed by us. And only after we have battle tested in our cloud service where we have complete control only then we also hand this over back to the community where they know it's already battle tested. So if the cloud has really shifted a lot, and it's even better for the community because we battle tested first and then shift it to the community.
00:12:40
Wow. And tell us a little bit more about the zookeeper change. What was this all about? What was the motive and how will it impact people now?
00:12:49
Yeah, so, in the end, in the last, I would say 10 years, most distributed open source project used zookeeper as a key value store for meta data management. That's had hadooop, that spark, that's Kafka, it's all these kind of technologies, right? But therefore the big problem is that. You don't have just one distributed system, but you have two distributed systems, one zookeeper cluster, which manages the metadata and then in the end it coordinates also the Kafka brokers, which is another distributed cluster. And zookeeper was never built really for that scale and reliability you needed in Kafka. And so in most of the operations issues where you really had a P one downtime in many of these cases. Actually, it was not a problem of the Kafka brokers, but of zookeeper. And if you ask customers or have your own experience with operating Kafka clusters you will realize that most of the challenges are not operating Kafka, but zookeeper. And so there are two big advantages of the zookeeper removal. Number one is the operations gets much easier because you only have one distributed cluster. You only have Kafka, which takes over the capabilities of zookeeper and does it in a much better way because it also uses the Kafka log, all these features from Kafka for it, which scales better and so on. So, operation Simplicity is the one big benefit. And the second one is that it also scales much better. So a big limitation in Kafka in the past was that you can have only a specific number of partitions, right? Like, we typically recommend don't use more than a hundred thousand partitions in a Kafka cluster and something like maybe five to 10,000 per broker. It depends on the deployment, but that's the basic rule of thumb. With ZooKeeper removal, now, you can also do millions of partitions with a single Kafka cluster because how it works under the hood now with just Kafka brokers it is simply optimized and improved and this is by the way, now the big game changer also to many of these other distributed streaming systems. Like, a few years ago we had a few discussions about, hey shouldn't we use Kafka anymore? But maybe Apache Pulsar because what Pulsar was built for is really even more extreme scale. In theory because what they have built is not just two distributed system, but three of them with zookeeper, with Pulsar brokers and Apache bookkeepers so thats three systems. And already two is super complex. To operate three is even harder. So we went the other way around and we architected in a way that in Kafka you only need one distributed cluster then, but can now still scale to millions of partitions. And this is much easier to operate. And also then even if you're using a cloud service, it's much easier because it needs less resources than three clusters. And with that you have huge benefits no matter if you operate it or if you're just the end user.
00:15:29
Wow. That's amazing. Right? So one thing that always kind of captured my interest was around what happens after you have put in data in Kafka, right. In most of the cases right now, Kafka acts as a kind of let's say in the early stages of adoption, Kafka mostly acts as a message broker in most of the use cases. But slowly it then evolves into a complete full-fledged stream processing platform where you're not just only relaying messages from one application to the other one, but you're also processing that data and kind of taking actions on the top of that. So first of all. Tell us, it's all there on the conference website, data in motion, right? So help people understand, what is data in motion? How is it different from data at rest? And why its tough that's the more important thing.
00:16:24
Yeah, so this is really crucial to understand and there's a lot of misconceptions and misunderstandings here, right? So, therefore one of the first things I always explain is that Kafka is not a message broker. But before that even, it's really, as you mentioned, it's important to point out the difference between data at rest, which in the end means that you store data on a disc or in a database and you keep it there, and then you do a query to that with a SQL query with a web service or whatever. So that's good for reporting or for training analytic model. But with data in motion, you can continuously act on data. While it's ingesting and I typically recommend to really start not from a technical discussion, but always start from the business problem, if that helps. And if you ask the business, whatever the business is, it really doesn't matter. The business will always tell you that realtime data beats slow data. So if you ask the business if it's better to use data now And act on it, whatever the action is an alert, a payment, even just a report. But now it's better than later, no matter if later seconds later, minutes later or days later, right? And with that's the game changer of data in motion with the streaming platform. That's the high level difference between where you still use a data warehouse for reporting is perfect for that. You still use a big data platform for training models, right? But for many use cases, Acting now is more business value, which can be reducing risk, increasing revenue, making customers happy. That's depending on the use case, but this is the difference of using data in motion while it's happening. And with that, now spoken about the technology, this is the big difference to a message queue or message broker. A message broker is only here to send data from A to B that's great, right? But that does not add the business value. The business value is when you also use the data in real time, and that's not what you do with a message broker. And therefore I explain Apache Kafka is four different components. Number one is the messaging component. That's what everybody understands and what people are using. However, number two, and that's really already the thing where people most underestimate is the storage of Kafka, because with Kafka, you also decouple the systems very well. You put it into the Kafka log, and then every single consumer can consume at its own pace because reality is that yes, realtime data beat slow data, but most systems so they are not realtime and some will never be real time. And so you get data into Kafka once from a realtime messaging system, from a web service request response or from a batch workload. And then you have it in Kafka once, and then everybody can consume it one real time, one near real time, one request response, and one batch. They all decoupled because of the storage of Kafka. And this is the biggest game changer compared to a message queue so that you really cannot just do messaging in real time, but also provide Data consistency across different systems because most systems are not realtime, and this is the biggest value of Kafka, right? Often even more important than the realtime capability. And in addition to the messaging and storage combination, which is the core of Kafka, you even have Kafka Connect for data integration, and you have Kafka streams or KSQL for stream processing for correlating the data. And while you are absolutely right that most of our customers or end users with open source start with the billing ingestion pipeline, even there, even if you don't do stream processing, which is more advanced, but even in the beginning, you should always use Kafka Connect for that, right? Because it's part of the Kafka ecosystem for doing integration with databases, with other message queues, with data warehouses that's built into the platform and you don't need yet another ETL tool or cloud service for that because even the integration and processing capabilities are built on Kafka, scalable, reliable, real time decoupling, guaranteed ordering. All that's built into one platform. And this is really what makes Kafka so unique in the market. With a message broker, you need to add another ETL tool with another code base and infrastructure, and you need to add another storage system and you need another correlation engine with a Kafka ecosystem, you get all of that in one platform end to end that makes operations, scalability, and support much-much easier. And this is really why Kafka is so successful in the market.
00:20:34
Wow. Wow. That's very interesting. And you know, while the, all four of these components are very obvious and kind of, you have explained it in a very nice way there's one thing I would want to dive deeper into is the very last component of it that is stream processing. And you talked about KSQL DB, right? So first of all help us understand how is KSQL DB different from, let's say flink or samza or beam? What's the fundamental difference there? .
00:21:02
Yeah, so let's first talk about what's the same, right? Because it's really important to clarify the stream processing in general means that you correlate data, right? You get data from different systems not all in real time, right? So maybe one is a sensor or lock data, real time, high volume. And the other one is a database, an Oracle. And the third one is an ERP system like SAP with a web api and you get data in and out of these systems and stream processing. Correlates the data. After you get an event in, and this can be stateless where you just take a look at one event, or this can be even state full events where you have sliding windows, like always monitory last 30 minutes or 30 seconds, and this is where stream processing is really, this is data in motion, right? You act when an event is happening and this is the added value. No matter which screen processing framework you choose. And now, however the question comes up, which one should you choose for your project? Of course. And the short answer is that from a feature perspective, most of these frameworks at least the modern ones, like KSQL, Kafka Streams, flink maybe even spark streaming they overlap 70, 80% of their features. So, all of them are pretty good and they have some nuances where they are better or worse. How I typically recommend customers to take a look at that is that in general, because this is all critical pipelines often at scale, right? The more components or infrastructures you add, the harder it is to operate this end-to-end with our data loss and the flow latency and also regarding cost and operations. The less systems you have, the easier, and this is in my opinion, the biggest advantage of using kafka streams or KSQL Kafka streams as Java KSQL as SQL code. So that you can do stream processing just with one ecosystem because it's built on Kafka, right? You only have one distributed infrastructure in the end Flink or spark streaming that separate systems. But they also have pros, right? Like. Flink is very strong in doing stream and batch processing. That's where Kafka is not the right tool in the end. And also Flink, for example, has Flink sql, which is ANSI sql. So it's really the same like you write against Oracle. That's not what KSQL is. And then Spark well, It's not real time and, but it's good if you already have spark in place or Databricks. So all of them have that trade offs. But in summary, I would say evaluating in these categories. If you just want to do stream processing, think about do you need another technology or is just Kafka with Kafka streams okay sql enough for that then it's the easiest and cost efficient. Otherwise, I think flink is really the other standard on the market. This is, the adoption is huge, right? And it's a great framework. So, this is the other framework we see vendors emerging around that cloud services are available. The open source community is growing like crazy. So that definitely a great option. And then I mean I'm also, Databricks is doing more streaming these days. But Spark, I would say is mainly when you already have spark workloads and want to also combine that in the same cluster with some realtime data, then Spark is good. But these are typically the options. And then there is, again, products or cloud services around this, but I would say these are the standards in the market that you should take a look at. And because , that's the same like we discussed in the beginning. This is where the communities are, right? This is where the adoption is. There's different competitors for that. And this is a win-win for the end user. And on, I would only take a look at other niche stream processing engines if for a specific problem that's built for that. But in general, these are the right ones to take a look and evaluate.
00:24:34
Right. And in terms of KSQL DB how is it different from a materialized view? Right? Is that the underlying technology the same, like, the way you have materialized view in Postgres you have a few materialized database like materialized.com. Is that a similar technology under the hood or is there something fundamentally different with the KSQL DB?
00:24:57
The fundamental difference is that we are here talking about data in motion. While something like a postgres is data at rest, that's the fundamental difference, right? So our viewers, in the end, you take a look at the data, what it is today, and again, in a database, it's a very normal behavior. We know because we store data in our tables. Then we put some use on top of that, and then an end user can carry them with SQL or a web api. And that, but that's data at rest. With KSQL, you do exactly the same and the same with flink, right? You do the same, but you keep the state in the streaming application and then you can query it, for example, from a mobile app, rest API and carry the current state. The fundamental difference is that the state is continuously updated in real time, and so when you do a query, you can make sure that you get the right information, or you can also turn it around and do push notifications after that change. Right? You shouldn't query all the time. And this is the big difference between stream processing. You don't query for a change because that doesn't scale well. And you get notifications when something changes. And this is the fundamental difference. And here again it's not that one is better than the other. It's really depending on the use case. If you have a business intelligence tool like Tableau and just want to query data, then a database is probably perfect for that, right? But if you want to get notifications in real time, like, when you use your right hailing app, right, in order a taxi, and this is not where you want to do requests all the time for millions of users, this doesn't scale. Instead when, an event is happening, like the taxi arrived at the location or when you did a payment and do fraud detection, then you continuously do that in real time. This is where stream processing is built for and only stream processing scales well for that in real time. A database doesn't scale for that and it's not built for that.
00:26:41
Right. Absolutely. So talking about this coming back to the previous point that we talked about, so what are the typical cases where to not use Kafka? I see that you have written a good post about it. Would love to for you to share it with the listeners as well in terms of when to not use Kafka.
00:27:01
Yeah. So that's very important. And that's also, I mean, I go to a customer or prospect, and this is one of the first things I discussed, right. Because nobody's happy in a year if you find out too late that the technology is the wrong one. So use the right tool for the job and combine them. If a vendor tells you they can solve every problem, and some try to do that, it's not correct, right? Because it cannot work. And with Kafka it's pretty easy. So, I would read, and that's what I did in my blog and presentation. When not to use Kafka, you can easily qualify it out in some use cases. The number one use cases Kafka is not for complex analytics, while Kafka is for realtime data and you can store it and you can have some simple query mechanisms like a key value query or replay historical data and guaranteed order for complex analytics like you do it with an NCSQL query with complex joints, that's not Kafka. It's that easy. So for that, you use an Oracle database, MongoDB, a time series database. Use the right tool for the job. So it's not for doing complex analytics. It's complementary to these databases. That's why most people use Kafka Connect for getting data into Kafka in real time. And then each consumer can decide by themselves some consuming real time directly for alerting. But some others need to get into a database for complex analytics, right? So that's the number one use case and the easiest one or the most important one. Besides that, there's other ones like Kafka often does not do the last mile integration. What does that mean? Last mile integration is, for example, when you have hundreds of thousands of cars on the street, right? Like, the last mile here is typically MQTT. Maybe it is a web API if you are in the smartphone gaming sector. Or if you are for example in manufacturing to IOT and robotics, then it's also not the last mile integration, that's an IOT protocol then. So typically Kafka is for the server side, for the backend. It can connect to clients and mobile apps, but not at scale. So the basic group of samples, as soon as you need to connect to thousands of client applications, it's not Kafka. So, for example, with Confluent and also our community division, we have a rest and HTTP proxy. So you can connect from a mobile at to Kafka, right? With produce and consume, but not if it's thousands of connections. So that's the other thing to easily qualify it out. And the third one. Which also sometimes people try to solve the wrong problem if a technology is yes. We're talking about real time with Kafka, right? But always define what real time means. It's that important. Real time with Kafka means that you can do end-to-end low latency around 10 milliseconds or slower. , which is good enough for 99 or 9% of use cases. But if you want to build a microsecond trading app like NestEgg, right? That's not Kafka and that's also not flink and not any competing technology that's not pulls up, that's nothing like that. Or if you're doing hard, real time for safety critical operations, like, building the next engine for your plane or doing realtime robotic systems for human collaboration, that's hard realtime, that's sea or rust, right? So that's very different technologies. Kafka integrates with them to get the data out of there and correlate with SAP, for example. But it's not for building safety critical systems. And so in summary, really Kafka is not for complex analytics. It's not for last mile integration if it's more than a thousand connections. And it's also not for hard, real time or microsecond latency. And that's in the end, an easy way to qualify out because that's not what Kafka or similar technologies are built for.
00:30:35
Wow, that was very amazing. That was very insightful in terms of when to not use Kafka. So Kai. Looking back on your career and your work with Confluent and event streaming technologies, what have been some of the most rewarding or memorable experiences for you?
00:30:54
Yeah. So as I said, so I'm now with Confluent for around six years. So that was a time where most people didn't know Kafka at all. Today everybody's using Kafka around a hundred thousand organizations around the world. So it's really used everywhere became the defacto standard. Now it's really the question, how do you use it and how do you do advanced use cases? So I really see Two key innovations are things that are a game changer. Number one, as we discussed already, is the cloud, where you use it really as a fully managed service so that you can focus on the business logic here. The critical pieces really to understand also that there's many cloud services for Kafka in the market. Many useful marketing but most of them are not really fully managed. They still just provision the infrastructure and hand it over to you for supporting and that's not fully managed. So to really do the right evaluations and read the terms and conditions, but fully managed, and again, not just Kafka, but the whole ecosystem, that's really the game changer, including the connectors to other cloud services like Databricks or Snowflake or MongoDB Atlas or, so that's the game changer number one. And then from a use case perspective, the real secret source in data streaming is the stream processing part. Most customers start with building pipelines because it's the easier part and still adds a lot of value. For example, in connecting you on premise applications to the cloud or connecting to your mobile apps and sensors and getting it into a database, super valuable and at any scale works with Kafka. But then building use cases with stream processing like customer recommendations in real time while the customers in the web store, right? We all know the use case, like, you bought this item on Amazon. Maybe you take a look at that one, but that's more a batch use case. What customers are doing with stream processing is while the customers doing clicks you do a real time click stream analytics today instead of putting it all into Hadoop or Spark, running a batch workload and sending an email with a recommendation a day later. That doesn't work because the customer already bought it somewhere else. But doing this kind of decision makings in real time with stream processing, that's the real game changer. And that's true for any industry retail, for upselling, for example, in the telco industry, for sending alerts to your customers about network outages and predictions, predictive maintenance and IOT fraud detection and payments. Right? And that's what every industry needs because every industry has some kind of payments. And if you do fraud detection and batch in your data lake, you will still detect the fraud. But it's too late because the fraud already happened, right? You need to detect it in 10 milliseconds before the payment is accepted. And so across industries, the secret source really is stream processing. You typically don't start with it because it's more advanced. But when you have built your pipeline first, then it's easy to add one more application or two that do the stream processing later. But that's really the innovations we see the cloud and then the stream processing.
00:33:45
Amazing. And, you know, kai before we kind of wrap up our today's episode, I have one last question for you. What are the challenges that you see are still persistent in the data in motion? Like what are some big unsolved problems that you see are still there in data in motion. And what are you guys working on to be able to solve that in the near future?
00:34:14
Yeah, so that, that's a great question. There is, again, two different kinds of answers. The number one problem and challenge is that people learned how to develop applications at rest. So, 90% or even more of the developers and architects on the market in the last 10, 20, 30 decades, they built a web service and a database, whatever technologies is or was. So web services, rest APIs, doesn't matter, right? But it's an API or a SQL and that database, or it's not a data lake or Lakehouse, but it's data at rest and that's how you understand how to implement. Data in motion, and no matter if it's Kafka or Kafka Flink or anything else, it's a paradigm shift in how you develop applications because the patterns are different, the best practices are different, the technologies are different, the APIs are different. So the number one challenge still today is education. People from the university, they get it right because they learn it already. That it's different. But for most people in the market education is the biggest challenges. And that's why we have on websites like our developer experience, right, where we train people about how to do that. And of course, on top of that, we also build tools like we have a visual coding tool that you can drag and drop. Pipelines together like you did with your favorite ETL batch tool, but then it's streaming under the hood automatically and fully managed. So, that's the one thing to make it easier and to help people change this with this paradigm shift. And then for the people which are already using Kafka, understood it for the first use cases. The real challenges here are really the more advanced once then, like, I'm talking about security, compliance and data governance. So these days, Many people are talking about the new concept of a data mesh right building independent data products, separation of concerns, domain-driven design, microservices. That's great to decouple everything. Each business unit can do their own things with their own technologies. And Kafka is the perfect data hub for that because it's real time scalable and decouples the things. But with that is organizational challenges comes up. So who owns the data. How can you enforce that it's the right API contract? Who has access to the data? Do you have audit logging for that? How can you enforce encryption end to end? These are the questions that typically don't come up when you have your first pipeline from one database to another, but they come up as soon as more business units are using that and because this comes up. So much in the last years and many of our advanced mature customers, they've built their own solutions on top. But because every customer now needs system, and this is really where we are solving this problem with products. On top of that, like if you take a look at Confluent Cloud today one of our biggest pillars, we invest into a data governance with the same things you actually know from your data lake or data warehouse, it's a data catalog. It's governance, it's data lineage, right? This, that's not new concepts. But they are mapped from data at rest in the past to data in motion where it's all real time and there is different challenges. But that's in the end how we help with education and with products to solve these problems.
00:37:15
That was very insightful, Kai. Thank you so much for that. So, Kai, as we kind of wrap up the episode for today, we again, thank you so much for being a part of this show. It was such a pleasure having you here and learning a lot of things from you. So thank you for your time.
00:37:29
Yeah, it was great to be here. Thanks a lot.