Get stories of change makers and innovators from the data stack ecosystem in your inbox
Meet Juan: who turned his Ph.D. research into a successful company, Capsenta, and now is a Principal Data Scientist at data.world. He believes data management needs a paradigm shift and that shift is including social aspect to the business. In this interview we touched base on his journey, how the data scientist role has changed over time, the ABC of building data products, and the socio-technical approach to data. Read about his journey and his work which are amazing as his podcast.
Tell us about how you went from being a web developer to now a principal scientist at data.world.
My parents are Colombian, but I was born and raised in the Bay Area, in San Jose. When I was 10 years old, I moved to Colombia and that’s where I started my undergraduate education. In 2004, a guest professor, Oscar Corcho, came to give a seminar on Semantic Web, and I was fascinated by the topic - what’s the next generation of the web or how can the web be smarter. That’s what caught my attention. After that, I transferred and moved to Austin to finish my undergraduate degree at the University of Texas.
I've always been interested in research, but at the same time, I always liked entrepreneurship. In 2005, I started my first company. I had an engineering base in Colombia, my business partners were in Switzerland, and I was in Austin. Later in 2006, I met Professor Dan Miranker who asked me: What is the relationship between relational databases and semantic web technology? And that question changed my life. I really got interested in that problem. So I applied for a PhD at the University of Texas, got accepted, and left the startup.
The thought behind the application of the research was: how are semantic technologies going to take off? People have their data in so many different relational databases like Oracle, SQL server, etc. How are they going to put all this together? Well, we had an answer from our research. There was a startup opportunity there, and that’s what we did. In 2014, I officially started the company Capsenta with my PhD advisor, Dan Miranker. Capsenta was doing semantic data virtualization. Basically having your semantic layer, and mapping that back to all your different databases. This was before dbt was a thing.
data.world was actually one of our customers and one of the founders of data.world, Bryon Jacob, did the tech due diligence for one of my investors. So I knew him very early on and I'm like, “who is this guy asking the really hard and really smart questions?” We really hit it off and just kept talking for a long time, sharing ideas. Then it came to a point when we decided we should be working together. Both companies were a match in terms of the mission, the vision, and the technology. In 2019, Capsenta was acquired by data.world. That’s how I ended up here as a Principal Scientist.
In one of your blogs, you wrote that “data management can’t just be about technology. Data management is a socio-technical phenomenon.” Why do you think so?
As technologists, we don't like to talk to people. That's the problem. If you look at the articles that people wrote back in the mid-90’s about the data problems they were having, you'll realize that these are the same problems that I heard yesterday. So we’ve had 30 years of dealing with the same problems. Why?
I would argue that with all the technology pushing that we have done up to now, we've hit a barrier. If we continue to think that we just lack the right technology, we're literally going to drive ourselves insane. We need to have a paradigm shift, and that paradigm shift is to start inputting more of the social aspects into data management. If you think about it, we as technologists, define success from a technological perspective. But what about social success? At the end of the day, organizations need to answer critical business questions in order to make money and save money.
Technology is a means to an end. I think we have focused a lot on technology, but haven’t really zoomed out to understand what those ends are, which is the social aspect. We really need to understand the people, the process, and all the context around that stuff. I would argue that the successful technologists are the ones who become more social.
What would be your advice to data practitioners and data owners so they will be able to adapt to this socio-technical approach to data?
First of all, we talk about data literacy all the time. I think we should start talking about business literacy. So that means not just the data people, but everybody in the company needs to understand how the business works. So, for me, the number one thing is talking about business literacy.
Number two: Get comfortable outside of your comfort zone. If you live in your bubble and feel great, but you want to be successful, you have to get out of that bubble and feel uncomfortable and learn from that discomfort.
The third thing is to understand the problem holistically. Keep asking “why?” Successful data practitioners are the ones who are a lot like therapists. They ask a lot of questions. You have to do the same with folks around you. When someone asks you to pull certain data, ask, “Why do you need this data?” The most common answer you’re going to get is to democratize data, so that they can be more data driven. Of course, we need to be all of that. But what’s unique here? At some point you may realize, oh, that person really doesn't know. That's why it's important for everyone to understand how the business works.
Next is knowledge. We live in a world of just data. We need to move to a world of knowledge and understanding of the people, context, and relationships surrounding data. There should be a balance between efficiency and resilience. We are all so focused on having short-term solutions that are very efficient, but by doing that, we are not setting the foundations for being resilient. Think about when the pandemic started - we ran out of toilet paper. Why? Because we had an efficient supply chain, but not a resilient supply chain. The Suez canal is super efficient to go through, but when a boat changes its angle and gets stuck, the entire economy of the world shuts down! Hence, the supply chain was not resilient.
How has the role of data scientists changed from when you first started studying computer science back in 2004 to now? And what's your take on the role of Data scientist and Data engineer?
Let’s go back in time and look at it decade by decade. The web was invented by Tim Berners-Lee in early 1990's, and it became very popular specially with the rise of e-commerce. This is when we started generating data for the web, which took us into the big data era. Now you have a lot of data, and in order to be able to manage these clusters, the role of data engineer was established.
Then in 2010’s, the Cloud became popular along with AI/ML and deep learning. This is when we started getting insights from all the data that we had harvested. With this, machine learning also becomes a commodity. Anybody can become a data scientist, because all they have to do is learn various frameworks. You don't really go and develop these models. During the 2010’s, we heard a lot about the 80/20 principle. Spending 80% of your time cleaning the data and 20% of your time doing analysis.
We went from ETL to ELT - dump the data without understanding what it means, and then you're trying to make sense of that data. We till haven't done that knowledge modeling and semantic work. So then we start seeing data scientists complaining about this, and that's what happened from 2010 to up to now. The shift now is that people are starting to realize that they need to understand if they can rely on this data or not, and what the data actually means.
The folks at dbt have really pushed the idea of transformations. With the rise of analytics engineers, we're starting to understand more about the meaning of data. The data engineer of the 2000s (Hadoop times) is different from the data engineer of today. It's different because we are no longer using Hadoop, but the principles are the same. From the principles perspective, we need to understand the inputs and the outputs, and forget about the name of the technology.
In ETL, the E and the L are extracting & loading. The principle is that you're moving data. You need to have very strong people who are making sure that the data is moving reliably with the use of technologies like data observability and monitoring, etc.
The second thing is to make sense of that data, that’s where data transformations and modelling come into play. Analytics engineers are responsible for this task, and I like to call them knowledge scientists.
So if we look at history, we focused on the start and end of the data. Where does this data come from and where does this data go. This left us with a big gap in terms of understanding the data. And now we have finally realized that we have this gap.
You have built data products for your entire career. So what are your thoughts on creating an amazing user experience while building a data product?
When you say the word ‘amazing,’ for me, it is my experience buying anything on Amazon. I can go and click a couple of buttons and get anything shipped anywhere in the world. 99% of the time, when I buy something on Amazon, I'm happy. We could not do this five-ten years ago. So why can't we have that amazing experience with data within our organizations?
Now, Amazon knows me as the end consumer. As an end consumer, I don't care about the data pipeline. We as data people live in this bubble of data, where people think they're the greatest thing. Within our bubble we are, but lets be honest, outside of it, we probably aren’t.The majority of people don't care about the work we do because it’s a means to an end. So that's the first thing - we need to get off our high horse and stop thinking that we are the greatest thing in the world and learn about the business.
Secondly, we need to understand who our consumers are. We need to provide an amazing platform to search for products. That's what a data catalog is. I can have the best data catalog, but if I don't have products that can be sold or found, the catalog doesn't work.
So how do I get those data products? Organizations must understand business user’s problems. What are their needs? Who are the types of users they have? Doing market research and understanding your consumers is incredibly important.
Tim Gasper, a colleague of mine at data.world, and I, we do a podcast together called Catalog and Cocktails, and we came up with this framework called the Data Product ABCs. The ABCs are: Accountability, Boundaries, Contracts and Expectations, Downstream Consumers, and Explicit Knowledge.
So that's the experience that we should be having with data at the end of the day. The data catalog is like that shopping experience. There are two lenses of a catalog. The lens that I just described is a consumer lens. They don’t need to go into the details of the data quality. All they want to know is if I can go use this for my use case. The second lens of the catalog is for the technical folks. They are going to get all this data from different places and are going to have to deal with bad data. They really need to be able to understand where the data comes from, because they're the ones who are trying to make sense of what this data means.
Recently, there has been an explosion of tools in MDS. What are the areas that you think are not very well served by the whole modern data stack?
My honest answer is everything's there. There's already too much, and it’s great. What we need right now is to focus on the social aspect of things, and for that, we don’t need any more technology or any more stack. We're seeing that every little feature is almost becoming its own category and its own set of companies. That's too much. We just have enough of the technology and I would say we don't need more.
RAPID FIRE
What's your favorite tool in the whole modern data stack?
Snowflake
Any recommendations on newsletters or podcasts or any good resources to brush up data knowledge?
I obviously need to say Catalog and Cocktails, our honest, no-bs, non-salesy data podcast. My main source is LinkedIn. I just follow a bunch of people on LinkedIn who have amazing ideas to share, and then the LinkedIn algorithm recommends a bunch of people that I can follow.I recommend following all the guests that we’ve had on our podcast.
Finally, I always recommend reading the book Software Wasteland by Dave McComb. This is a must read book by every single data professional in the world because it describes the root of data problems: an application-centric architecture.
One thing you love and hate about your job
I am extremely lucky to be exposed to so many problems, and this is what excites me. I get to talk to so many customers and understand what people's problems are.
The thing that I dislike - well, let me be cliche: Love what you do and do what you love. You’ll never work a day in your life, and hopefully you’ll get paid for it.