May 02, 202329 min
Share via:

S02 E10: From On-Prem to the Cloud: Managing ClickHouse with DoubleCloud with Natalia Shuliak, COO at DoubleCloud

When working with open-source technologies, you benefit from the community's creations, but you also have to do a lot of admin and support work as the technologies tend to break, and support usually falls on yourself. This is where DoubleCloud's platform comes into the picture. In this latest episode of the Modern Data Show, Natalia Shuliak talks about how DoubleCloud saves you from administrative work and allows you to focus on data pipeline development and management, while providing backup, security, and support.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Natalia Shuliak
COO

Natalia Shuliak is a Chief operating officer at Double cloud, which is a platform to help data-driven companies build cost- effective, sub-second analytics on proven open-source technologies like ClickHouse, Kafka, etc. She has previously worked with major companies like Microsoft, Databricks, and Splunk.

In this episode

  • Evolution of the data industry from the time of Hadoop and Spark to the Modern Data Stack.
  • How DoubleCloud helps building the sub-second analytics.
  • Administrative problems of open-source tech.
  • DoubleCloud's pricing.
  • Change in licensing of open-source technologies

Transcript

00:00:00
Hello everyone and welcome to another episode of the Modern Data Show, where we delve into the most recent trends and technologies in the data and analytics, engaging with some of the most brilliant minds in the industry. Today, we welcome Natalia Shuliak, Chief Operating Officer at DoubleCloud, which is a platform to help data driven companies build cost effective sub-second analytics on proven open source technologies like ClickHouse and Kafka. She has previously worked with major companies like Microsoft, Databricks, and Splunk. And welcome to the show, Natalia.
00:00:31
Thank you very much. Happy to be here.
00:00:34
So Natalia, I'll start with the most basic question. Tell us a little bit more about your personal journey. How did you end up in DoubleCloud?
00:00:42
Yeah, that's a good question. And the journey was an adventure in itself. By origin, I'm from Belarus. I'm living in France for more than 10 years. I'm always joking that I'm adopted French. And it happened to be that, I was lucky. I think that I always wanted the jobs that were intellectually rewarding. And that was also my motivation. So I was always wanted to be in something new that looks into the future. So when there was a period in my life when it was right after bubble dotcom exploded and some investors approached me and they said you know what, we have these fantastic devices, palm and pocket PC and we actually created some applications for these devices. And would you want to help us do a go-to-market for us on US market or Germany market or the UK? And I said, wow, that sounds amazing. I have no clue what means the application. But I've already seen palm but never played with that, but why wouldn't I try and this is how I joined the whole technology world. After that, I actually went into the cloud also reading lots of reports and lots of research is that, okay, this is a new big thing. And this was at the time when you know Microsoft was still all about Windows and nothing else, and I was lucky to be part of the team that was launching Microsoft Azure here in France and it was still called Windows Azure just to give you a perspective, how it was and then when you're already in the cloud world what's next big thing that comes to the board? Obviously data and when I was doing my global executive MBA at INSEAD, there was several most important courses that everybody absolutely wanted to attend and they were all about big data and AI. And we looked into how you code with Python and you also look into how you should incorporate data into the whole strategy of your company. And for me, it was like, okay, this is the next big thing I wanted to be a part of. So I graduated from my MBA and I joined Databricks and this was like really how I got into the big data world and it was extremely exciting. So I was lucky because at the time of Databricks, at that time, absolutely every CTO had a budget for AI projects. And Databricks was in front of them for to talk about that. And it was amazing. You were discussing it like everybody was dreaming about how things can be. And that was amazing and then after that I continued to Splunk. This is the operational side of data usage. And this is the moment when DoubleCloud actually knocked on my door. And that was so convinced by the development team. The development team is amazing. They are doing amazing things. They have years of experience of managing open-source data have years of experience managing a ClickHouse, one of the key things that we have. And I'm like, okay I'm sold. So this is the company I'm joining as my next step, because I believe in what they're doing.
00:03:37
Okay, I would love to, and we would actually dive deeper into what you guys are doing at DoubleCloud. But before we get into that, tell us a little bit about your days back in Databricks when Hadoop was still the shiniest new thing in the market and Spark, Hadoop and, Databricks, probably one of the Pioneers of Spark and the Spark ecosystem. What do you think has fundamentally changed since your days at Databricks versus what you're seeing now in the form of this whole modern data stack?
00:04:15
I think what changed, two things changed fundamentally. So one, the expectations in terms of time, so it was okay previously to analyze your data and have results once a day. And the world accelerated so much that it's no longer the case. McKinsey, I think last year they did their research, 90% of business leaders expect when they're saying I want something fast. I want my dashboards to give me business results or business analytics fast. What they mean is like less than 10 seconds. So this changed completely. So second major changes is the following. So for four years, there was a huge promise in data and companies were buying tools and they like experimenting, everybody had budgets and so on and so forth. Now, 2023, the world of free money actually ended. The world of promises ended. Everybody wants to see the results and they want to see them now. As a result, what it means give me so either decrease the cost radically of your tools and keep playing with them or give me the tool that actually produces their return on investments. And then it is linked with the problems like, okay, I have too many tools. How optimize of that? There's still issues with performance and business leaders. So the partners, the companies I work with they want something extremely fast for analytics. How can I increase the performance? So I'm not ruined. And the third, how can I find the talent? Because everybody's hiring a data engineers, data analysts, and so on. People are either stretched or they do not produce enough of value. And then the biggest question, okay, how can I make them produce enough of value, but not overstretch them? And that's another major change. So we moved from their playground with the technologists around data to, okay, let's do some real stuff, and that's, I think what changed.
00:06:14
Okay. So now with that, help us understand what does DoubleCloud do?
00:06:18
Okay. So DoubleCloud gives you a platform where you can build your data pipelines, data analytics, or modern data stack with managed open-source technologies. And actually we're trying to keep in mind what I've just said. So today the world wants to build real time or near real time analytics and this is for different scenarios. End user analytics or customer facing partner facing analytics cannot be something long. So previously everybody was building data warehousing and it was okay. For queries to perform within 10 seconds, 20 seconds and so on when you are providing, for example, data to your customers so your partners it cannot be that long. This is 1st and 2nd, the licensing model cannot be what it was before. So when we keep all of that in mind. To give you an idea, 1 of the customers of ours, they are providing data sets to big retails, retail groups. So now imagine the most often frequently queried data set goes to CIO of this retail group. And CIO starts querying it and then it's 10 seconds and 10 seconds is extremely long. It's 1, 2, 3, it's a long time. And then imagine there are several CIOs that query the same data sets. So it stops responding and that's why they started working with us. They needed the technologies like ClickHouse or Kafka and so on so that it becomes a very different story for them. So it's really sub-second and now CIO can actually get the results within second and the query is actually performaning as they should be and create the reputation of the company, and that's basically what DoubleCloud does. We help you build the sub-second analytics.
00:08:12
And why couldn't those companies do it themselves? Why did they need you?
00:08:17
They are still doing themselves. There are two aspects if you're working with open-source technologies, there's always two aspects. One, you benefit from the community out there: what is being created and it's amazing, but then you do also lots of admin work and lots of support work because technologises have a tendency to break. Somebody needs to support them. And if it's an open-source, usually it's yourself. So what we do, we take out this admin kind of administration from you, this headache from you and you focus then on what you love doing and where your expertise is like working with the data and making sure that your data pipelines work exactly how they should be. And then we provide for you backup security and support. We provide some features that respond on very specific use case that you have and then you just focus on creating and making data work.
00:09:15
And one thing that I'm very curious about is what you're talking about is essentially the underlying premise of the OLAP databases, right? Instead of having a traditional OLTP processing for analytics workload you have these OLAP workloads. Why ClickHouse? Why not and that's actually tied to another question is how did you guys even start DoubleCloud? Was it because there was a goal to build a sub-second analytics workload system for the companies and let's find out what's the best tool to do that. So we say, okay, ClickHouse is one of the tools and we use that and we build on the top of that or is there any backstory related to ClickHouse?
00:10:07
There is a backstory and there is also, how things started. The backstory is that some of the developers in our team are contributors to the open source-code and they know ClickHouse and our CTO used to build the first managed ClickHouse ever in the world. And, that was tested on quite a lot of customers and then he moved to DoubleCloud. So there is this expertise but then, we are data company, meaning we talk a lot with the companies out there and trying to understand what's really happening. What's the real need? And ClickHouse is one thing but the real need, you said it yourself. Okay. There is OLTP. And there is a OLAP database. It doesn't mean that you, for example, if you have Postgres, it doesn't mean that you're forgetting about fantastic database as Postgres. No, you keep it, but you also can offload your analytics to ClickHouse. But then you have a very specific question. How do you replicate the data seamlessly from Postgres to ClickHouse? And we have a tool for that, there are tools out there, there are lots of ELT tools, nobody, even on your website there are I think four to five ELT tools that are mentioned. The problem is that these tools are okay when you're working with a little amount of data or time is not crucial but when you need to replicate databases from one place to another, it can be Postgres, it can be Snowflake. It can be, it doesn't matter what database, but you need to replicate your data to a database like ClickHouse because you need analytics to be sub-second and all of this cannot do this. Then, what tool do you use? And this is where our second expertise, we're looking from the use case end-to-end perspective and that's why we are providing you with the tool for that. And then we have a customer called Yango Deli. So what they do food delivery across African countries like they're operating in 16 African countries today and they're providing. So for example, you decided to create your own family business to deliver food but you need some technology to do this and the ideal is somebody actually gives you operational know-how. What's to track and so on and so forth. So Yango Deli partners requested the dashboards to start with. They wanted to look at the things and to see how business is functioning. The problem is that the hole in data analytics of Yango Deli was built as a batch analytics and the dashboards were renewed daily, once a day. And they looked at traditional tools not DoubleCloud, but traditional tools and they realized, okay, So if we we give our partners a dashboard, so it will take us almost one year to change the whole batch analytics to near real-time and second it will ruin us, so it won't work with our margins because, it's a licensing model per user for most visualization tools. And they reached out to us and they started with our free visualization tool and then they built a so it reads the ClickHouse natively, and then they offloaded from their Postgres, they offloaded the data to ClickHouse to analyze it. And with that they got almost near real time dashboarding that they give to their partners. And all of that they managed to do within one week and this is a real use case today. It changed completely the ecosystem and this is where our expertise lies. We talk a lot with the companies out there and we believe that we understand what's the real use case and provide tools for that.
00:13:55
Okay and that's super helpful and that kind of makes a sense when you initially explain why do you use Kafka. So basically what you're saying is you have built like a changed data capture kind of a technology, which is able to read asynchronously read all the changes that are happening in those operational databases and push it into ClickHouse, which is basically an OLAP database, right? Quick question there. When you say you also manage the pipeline, so that's what you mean, right? So you are basically building those retail pipelines only for database application and you are not probably building sources and connectors like Five Tran which would allow you to ingest data from Salesforce or Hubspot or any of those CRM systems, is that understanding, right?
00:14:47
No, actually not really because our transfer service has a portion of connectors that's open-source for Airbyte. So whatever Airbyte can do we'll do this as well in terms of connectors and then we are adding ourselves the connectors that are not there. So for example, one of the very popular use cases we've seen - several companies coming to us they're saying we have snowflake as a database, data warehouse. We love it. But it doesn't work for this customer facing end user facing analytics, so we want to offload kind of things to ClickHouse, please help us, and this is where we created the connectors with snowflake so that our transfer service can actually do this for them. So that's one of the things.
00:15:30
Amazing. And let's talk a little bit about scale. Tell us, for your biggest customer without naming them probably, what's the size of the data you're handling for them?
00:15:44
Overall that the ClickHouse, it's I think the biggest amount of clusters for the biggest cluster in the world has petabytes, even more for data, right? So there are no limits. And we have different, so we have both small is fees like really small. You want to just do your startup and you want something extremely cost efficient that you can create within one week. So these are the type of customers that would go to DoubleCloud. And then they will be also big ones with huge amount like petabytes of data. And the reason they would work with us is that they want predictable pricing and performance because their business otherwise would be at risk and that's the scale, that's depending on your use case, if the use case works then it doesn't matter how big you are. So both things can work.
00:16:45
Okay, and i'm talking a little bit about your offerings like is it completely on cloud or Is it on prem or do you do a kind of a hybrid of both?
00:16:56
So, our offerings are fully on cloud. Imagine you're in the company you have, let's imagine something on premise, you have ClickHouse and you have it on premise, but for whatever reason you need to use Kafka to put the events into into ClickHouse. You can still use actually our Kafka, which will be on the cloud with your ClickHouse or you need so you decided again, let's go, let's continue exploring this ClickHouse, example - you have ClickHouse self-hosted somewhere on your service you actually decided, okay, I need to back up something into the cloud. You would use normally our transfer service from ClickHouse to ClickHouse in the cloud and back it up there. We have features like ClickHouse over S3, meaning you use it it as a hybrid storage, so you have unlimited amount of data that is stored in S3 at the price of S3. So it costs nothing, right? And then you analyze only what's needed and you put this data on SSD, and then it decreases the cost significantly. So there are cases like that. So companies are using us for just backup, just in case something happens, just their own service went down or I don't know, electricity is cut.
00:18:16
And how does the pricing for DoubleCloud works?
00:18:21
It pay as you go like literally we wanted to be extremely transparent to everybody. There was a very famous meme recently across the whole internet when somebody accidentally spent 303 seconds on Big query and that's the the other side of serverless. So we don't do this. We do a very transparent pricing. Pay as you go monthly billing, you know in advance what is going to cost you and there are no surprises, so it won't scale as you won't be per scan for example, so it's based on the data that is stored, it's not based on scan, it's not based on licensing per user. It's real data, real thing, storage not ingestion, not what the other companies would do.
00:19:08
Nice. Good to know that. So next question to you is, I think so in the last couple of years, what we have seen is a huge proliferations of fragmented tools across the modern data stack. You would have tools for, you can literally break down DoubleCloud offerings into let's say 10 or 15 different startups. You have BI tools, you have change data capture tools, you have your ETL tools, you have your data warehousing tools and so on and so forth and I think so we're seeing a consolidation that's happening in the industry where bigger players are branching themselves out in those adjacent territories and build a consolidated offering, right? What are your plans for that? You already have a good OLAP layer. You already have a good ETL layer. You have a database replication layer. What's next for you guys?
00:20:02
Yeah, That's a fantastic question. And this kind of goes into what DoubleCloud is, so I said in the very beginning, we are platform managed open-source technologists. Several use cases, one of the most typical is a real-time analytics, near real-time and user analytics, but what we really want is to create, to basically give you the platform where you can build or cover most of the data use cases and they would be different and there is a famous diagram how actually modern data stack that looks like, right? And there is everything. Depending on what you want to do, your data use cases will take different modules. So we want to give you this modules and the reason. So Gartner says, that already by 2026 lots of companies, most of companies will try to move their data use cases within one platform. So it means that the market will be about consolidation. It means that the market will be a platform kind of story for lots of companies and that's what we want to do. So as a result, it means that we will continue adding modules basically to our platform. Please stay tuned, but we actually soon releasing a managed Airflow which is one of the, even if you're not doing real-time analytics, but near real-time analytics which is still a batch analytics but extremely fast. You need a scheduler and therefore use a scheduler for that, right? And that's one of the things that is as a necessary module for that. And then we'll continue doing this, we're talking to companies and we understanding what are they really doing with data? What are their needs and we will be releasing these things, but there always will be promises like always cost efficient, be one of the most cost efficient provider performance and cost on the market, subsequent analytics and on the cloud. So that's basically three promises we're keeping open and obviously open-source. So it doesn't change.
00:22:03
All right. So talking a little bit about open-source, right? You mentioned you have built on the top of ClickHouse, you have Kafka, you mentioned Airbyte and just to add a little bit context to that we're seeing a lot of open-source technologies changing the licenses in terms of how their technologies can be used and a very good example of that would be Airbyte itself started as a a license that was much more permissive but now the per license that they have is much more restrictive, right? What risk that you see to the business? As your business as DoubleCloud in terms of leveraging these open source technologies and building a business on top of that.
00:22:49
Yeah it's great question. I think it's a double edged sword for absolutely everybody from one side it's a risk for the business, for those businesses that are building on top of open-source technologies. From another side, it's also community risk. So yes absolutely any open-source technology can change their license but then it will mean that the community will start switching to something else and then there will always be new alternatives. So nothing is constant as a result in this world, and that's where you can no longer be a mono technology company. You absolutely must protect yourself with having different offers to the market and same vice versa I'm talking from vendor perspective, but from the perspective of a company. Migration is constant. Be ready for that and you built your perfect modern stack today. Things change tomorrow. It doesn't work any longer for your use case or from performance and cost perspective, you will have to migrate. And that's why it's great to always build your systems from the perspective that migration will happen one day. And then it will be much less painful because today I talk a lot to CTOs and the data leaders of the companies. They all talking about, 60% - 70% of them are thinking about migration but they are thinking because it's hard.
00:24:14
That's an amazing advice. Next question to you Natalia is outside of DoubleCloud what particular tools or technologies in the broader modern data stack excites you personally? What's one of those things that you're super bullish about?
00:24:34
Wow, That's a good question. Obviously right now there is a huge hype and more than that about generative AI, right? So and that's amazing because I've seen several hypes on different technologies so far. There was a first wave about AI. Everybody was talking about that but then the companies didn't find the use case and it died and there were new things instead. Then it was blockchain, huge hype, somebody earned money, somebody lost money on that but there are still hopes something will come out of it but use cases are still to be born and this generative AI, I believe it's completely different story. So I, myself, right now I do two things constantly. When I don't have time I don't want to Google and look for the information lots of times, I go and ask Chat GPT. I didn't use all the tools but this is the one I'm using. And second one, when I'm tired of writing, rewriting some pieces of content, I will give bullets and then I will receive some meaningful texts right? And that's I think is a real interesting thing. It just tells us maybe there's less hype and more real use case, real thing that we can do out of it. And that it will impact quite a lot of spheres of our life. So that's what I see.
00:26:14
Okay, so you're bullish on generative AI. So that brings to the next point is, where do you think it will impact the most in data teams? Like which areas of data or working with data do you think like obvious use case? Like this is something that has to come like as soon as it can.
00:26:33
Yeah one of the things I think we're actually experimenting ourselves and trying to compare with the expertise but I think it can rewrite queries pretty well. So if you migrate, let's imagine from the Big Query to ClickHouse, always a big question is, okay your query. So same story would be from Postgres to ClickHouse or whatever. So what we actually are trying to experiment and see it can click or can Chat GPT actually create the queries. work with complex queries, simple queries is easy, but complex queries, can it do nicely? And I think actually the answer will be yes, it can. And it means that the migration we've just talked about will be a much easier use case and for absolutely any company. So there will be lots of migrations that are going to happen. Second is right now data analysts. This is their job that it's overstretched and then they suffer from quite a lot of things. If they want a data visualization tool, quite often they're dependent on their data engineers and they're waiting for weeks to create, to get some infrastructure, to get some data sets with which they can work. Obviously some of the things can be resolved just by changing tools, democratize your dashboarding and then it's easier to do and we do it ourselves with our visualization but next thing would be, okay, I think it will be a replacement of some data analysts because a chat GPT can actually pretty efficiently read the data and the standards and comments, what you see here and that will be amazing.
00:28:10
Nice. Amazing. Natalia, as we inch closer towards the end of the episode, what would be the best way for our listeners to reach out to you and probably see a way of working with you. What would be the best way of doing that?
00:28:27
The easiest is always LinkedIn. Again, I'm a modern person, I moved out completely from emails to tools like Slack, WhatsApp other messaging systems and LinkedIn. So you want to reach out please do it via LinkedIn. It's the best way for me.
00:28:47
Perfect. So thank you so much for your time Natalia and this was an amazing episode for us. I have learned a lot about the work you are doing, the work that DoubleCloud is doing. So thank you for sharing all of that with us.
00:28:59
Thank you very much and thanks for hosting me today.