Get stories of change makers and innovators from the data stack ecosystem in your inbox
Meet Dunith: a developer advocate @ Startree, whose blogs you might have come across on topics ranging from analytics, streaming processes, and modern data stack in general. In this interview, you’ll learn about some of the awesome things he’s working on @ Startree, unbundling of the streaming stack, his advice to the budding writers, and his data journey so far!
Tell us about yourself, Dunith
I graduated in 2010 and worked as a Java developer for around two years. Then I joined WSO2, which is a Sri Lankan open-source middleware company. I got into their big data team and was one of the key members t who drove the data analytics product vision. Then over a period of time, I eventually entered into consulting and ended up as a solutions architect at WSO2. After that, I started writing about my data experience, and my blogs started gaining good traction on the internet. I worked there for around 8 years and was eventually discovered by Startree and joined them as a Developer Advocate. Currently, I am working closely with the Apache Pinot developer community, helping people to adopt Pinot, educating people about it, and creating content about not only Pinot, but also analytics as a whole, especially modern data stack, stream processing, and streaming databases.
Can you tell us more about Apache Pinot and the core hypothesis behind it?
Before answering that question, let me explain two things
On one hand, we have traditional data warehouses like Big Query, and Redshift, and these are OLAP data warehouses. On the other hand, we have stream processes like Apache Flink.
Pinot sits in between these two technologies.
By definition, Pinot is a real-time OLAP database, or a streaming database. You can think of Pinot as a black box that can ingest incoming events from a streaming data source and make them ready for querying in a matter of seconds. Also, Pinot allows you to query the ingested data with low-latency and high throughput.
How exactly is Pinot different from ksqlDB or any other materialised databases in terms of the basic architecture?
In Pinot, we write data into columnar formatted files called “Segments” and they are scattered across
multiple nodes. We query these segments through the “broker” that knows where to find them and propagate
the query in a fast and efficient manner. Queries are executed at the node level and results are
“gathered” by the broker and sent back to the client. In short, Pinot queries work on
“scatter-gather” principle.
Conversely, ksqlDB and Materialize work on the principle of
incrementally updated materialized views where new data is constantly passed through them.
Can you share with us an example of a company using Apache Pinot for a specific use case?
A very popular example and easily understandable is LinkedIn. On LinkedIn, you will get a notification when someone views your profile in real-time. You will see the recent people who have seen your profile within a dashboard, this dashboard is powered by Pinot.
Another example is the merchant dashboard provided by Uber Eats. That dashboard allows restaurant owners to see real-time metrics about their orders, including current order count, total revenue, trending menu items, and customer satisfaction metrics.
In one of your articles you mentioned that “there needs to be an unbundling of the streaming stack”. What would that unbundling look like?
There’s a misconception in the streaming space that a single vendor platform can solve all the problems of
streaming analytics. Surely it looks like a bold statement which can’t be realistic anymore.
A
streaming analytics platform (or a stack) consists of multiple components that play a specific role in deriving
insights from streaming data. We need to identify these components and their role before planning out a new
streaming stack or converting an existing batch analytics system.
The article you mentioned is an attempt
to discover typical elements, their role, and potential vendors in the space to implement them.
In summary, we can “unbundle” or “decompose” a streaming stack into a few critical components as follows.
What would be your advice for someone who would want to write content in and around data?
Good writers are not born. You can become one by mastering the art of transforming what’s inside your mind into a flow of words. I’m still trying to master it.
Writing is about expressing your thought process in mind into words so that someone can easily understand it. This is how I define writing. So before writing anything you have to first identify what you are trying to write.
If you are a beginner, I would suggest you start writing about one of the recent projects. Think from a point of view of a reader, how this information would be useful for my audience if they saw this. One should always define their target audience. For whom they are writing for.
There should be a good connection between your first word and the last word. Write at least 100 or 200 words on a daily basis. Keep exercising for your brain.
What problems do you think are still unsolved in the modern data stack ecosystem?
I think we are yet to improve the data reliability, discoverability, and governance in the streaming data space. The
current technologies in the market are capable of getting you faster data. But you still have to make sure that the
data is accurate and reliable.
Also, as your streaming platform evolves, you need to pay attention to
data governance, things like how to share, enforce, and evolve event schemas. Also, we need the ability of tracing
event lineage-observing an event from its origin to the destination over time.
RAPID-FIRE
What is that one tool or platform that you think has really changed the whole modern data stack?
I think it started from redshift as I read. Redshift has changed the way that people think about data warehouses, and then it did the first data cloud data warehousing and opened the platform for others to build.
What is your go-to place for learning about all things, data, any particular book or newsletter or publication that you would recommend?
I follow data influencers like Ben Stancil, Gunnar Morling, and several newsletters including Data Engineering Weekly. Also, I like to read engineering blogs of companies like LinkedIn, Uber, and Netflix because they put out all the new stuff.
One thing that you love about your job and one thing that you hate about your job
The best part about being a developer advocate is educating people. Teaching is something that comes naturally to me. I love creating content that caters to different audiences.
There’s no such thing as hate, but yes I definitely get tired sometimes because it takes a lot of energy to learn something, digest it, and make it educational for the rest of the world.