Get stories of change makers and innovators from the data stack ecosystem in your inbox
Meet Andrew, Chief Data Scientist at RasgoML, who went from solving complex math problems as Applied Math Professor to building machine learning models to solve real-life business problems. He is someone who has seen the data space evolve over the years very closely. In this interview, read about his super interesting journey from being a professor at Towson University to Chief Data Scientist at RasgoML and why he believes with the emerging times the boundary between data scientist and data engineer have diminished.
Tell us a bit about your background, Andrew
I've been in the data space for a long time. I graduated with a Ph.D. in systems and industrial engineering in the early 2000s, from the University of Arizona. This was well before data science even existed. Then I took a job as an applied math professor at a school in Maryland, Towson University and while working with some of the undergraduates I got involved in machine learning. I found that interesting, but also found business problems more interesting than the academic mathematics problems I was solving. In 2006 SAS Institute, the statistical software company contacted me and offered me a role in a relatively new group they were setting up. This was even before people were talking about data science and we were building neural network models that helped in identifying fraud in real-time on mainframes before any of the modern stacks really existed. I did that for a number of years and then went on looking for new challenges and different kinds of problems to solve. I then spent five years at DataRobot, which is one of the largest vendors in the data science space. For the last two years, I was actually leading their go-to-market efforts in the entertainment space. Then in August 2021, Rasgo reached out to me with an offer to come over as their Chief Data Scientist.
How would you compare the role of an analytics/data engineer to that of a data scientist?
I've always been leery of that division because when I started, no division as such existed. If you think of me, I'm a “full-stack data scientist”. The reason why you're seeing divisions is because as a community we've gone out and we've told the data scientists that they need to focus on the modelling part of things rather than the engineering part of getting the right data at the right place. It’s really hard to teach people to do that kind of engineering in academia and it tends to be ignored. So people come up with the skill sets to do data science, the machine learning part, but they don't have the software engineering skills to build robust production data pipelines. I see the data engineer or the analytics engineer coming along because a lot of data scientists just don't have the skillset to really understand how to build these robust pipelines. Also, it’s an easy transition for software engineers who know how to build other systems to move into that analytics engineering. It's a much harder transition for those same engineers to move into a machine learning side of the things.
Let’s talk about feature engineering- a term that you would only hear from companies like Netflix, Airbnb, Lyft, etc who have built complex machine learning models and systems.How would you explain feature engineering to someone working at a small company and what does it mean for them?
When most people talk about feature engineering it means simply converting the data into a format that machine learning models can handle. Let's take a step back and talk about the purpose of the machine learning model- it is to find signals in your data. So there's something you want to predict, some behavior you're looking for, and you're trying to identify that behavior and there's a lot of noise. So the machine learning models, try to filter the signal from the noise. As a data scientist, I see feature engineering helping that modeling process, and identifying signals by building things that help you identify interesting behaviors. This means feature engineering is more than just the conversion of data into features useable by the algorithms. It is the process of capturing insights from the business and the data that can allow the model to better identify the signal. This can be features capturing existing business knowledge or based on patterns the data scientist found in the data.
In addition, a lot of data has a time series component. In some sense, it's a history (e.g. customer history, a machine log history). Machine learning algorithms need a single flat file with one record per observation. This means the data scientist needs to flatten that history somehow. Fundamentally that flattening is feature engineering. This flattening can be as simple as calculate the mean or other statistics, up through complicated mathematics such as calculating the Fourier transform of the sequence.
Feature engineering in the modern data stack
Feature engineering in extreme low-latency use cases is a really hard problem and that’s why feature stores have been invented.
At Rasgo, one of the theories we have is the more feature engineering you can push into the database, the better off you will be. In fact, that's what a lot of data engineers do, they take Python feature code from Jupiter or Python scripts and rewrite it to SQL. Honestly, that's one of the scariest things for me because then I have to go validate that their SQL version matches what I did in my Python code. It’s very time-consuming and difficult as there are many edge cases. I've seen mistakes made all the time. That's why, if you can do it in SQL, I think you're better off, but I'm not a great SQL programmer.
A lot of data scientists aren't great SQL programmers, and honestly, SQL makes some things much harder than it needs to be. At least from a data science perspective. But the advantage of going to that SQL environment is now I can get to near real-time that as data comes in, especially with tools like Snowflake which can process data really fast I can actually do relatively complicated things in a hundred-millisecond timescale.
Web latency is not an issue anymore. Back when I was building fraud models in 2006, where contractually the bank had to make a decision in a hundred milliseconds and the fraud model was a small part of that. There's a lot more time was spent on decisions like: Do they actually have the money? Is there a credit risk? Is everything technically correct? And the fraud models were trying to be one-millisecond latency .
We were doing complicated feature engineering, trying to figure out how do I do this in-memory processing where I don't have to move data because moving data is expensive.
At high latency, you can do it as you're doing it. Although there are benefits of moving it into a SQL engine, with high enough latency, pulling the data into a different system, processing it and scoring it will still work. And this approach can be useful as it avoids the testing that needs to be done if the code is rewritten. At moderate and near real-time latency, you have to move the feature engineering into the database. This is where where the modern data stack really shines.
At low latency, this is where tools like Tecton comes into picture.
What is the most challenging part of building and scaling a data team?
It's really hard to hire data scientists because while it’s relatively easy to identify technical skills what I found the most successful people in my data team are problem solvers more than they are technologists, and it can be really hard to find one. From my perspective, when I try to hire, I don't care about technology skill sets. As long as the individual has experience in a related technology, they can learn the skills we need quickly enough.
The other problem that I run across when hiring data scientists is they want to build models and that's a small part of our job. As a data scientist, you can't be successful without understanding what the input data is. It's your model, you have to learn about the political pain of change management in the end. You have to be able to explain or be involved in the explanation. As a leader, that's partially my job, but you have to be involved in that explanation to business leaders, to help them understand how to use this, what it means. The hardest part is managing business expectations, especially for junior- you're going to have to do a lot of just kind of low-level data work and you have to be happy doing that because it just needs to be done as part of your job.
What do you think are some problems in entire data lifecycle from production to consumption that are still unsolved?
I first want to highlight the amazingness of the change to ELT from ETL because before, I was stuck with data that had been transformed for other purposes and it wasn't quite right for what I wanted to do. So when Fivetran's and Airbytes of the world said that- “Hey, we'll just load all of that data into Snowflake for you” I loved that.
Currently, I still think the data catalog problem is unsolved. I've just never seen a data catalog work. Maybe someone comes up with a silver bullet, but I think it's more likely to be political and business decisions, not technological.
RAPID-FIRE QUESTIONS
What is that one tool or platform that you think have really changed the life of data engineers or analytics, engineers, or data scientists and why?
I think it is Snowflake. Snowflake just blows me away. I got introduced to snowflake and that's one of those moment- Wait! I couldn't even process this data before and you just gave me an answer! It's just amazing. And then the way they're growing is just tremendous.
What is your go-to place to learn all things about data, like any blogs or any publications that you regularly follow visually?
I follow a number of them. Like- Towards Data Science, KDnuggets, data science central. I don't find one that's consistently better than any of the others.
What is the one thing that you love about your job? And one thing that you just hate about your job.
I love solving problems. I love seeing the value I've produced. The hardest part, the politics that go into that, we still run into the situation where someone else owns the data. And I have to beg for the data. Neither of these are that enjoyable, but they are necessary parts of the job.