Spotify is the world’s most popular audio streaming service with 433M users, and probably unwarranted of a further introduction. While the company has continued to grow extensively for the last years, so has the need for a fast and scalable infrastructure to support that growth. At our last Heroes of Data Meetup, Spotify’s Sonja Ericsson joined us to talk about how they are migrating from Luigi to Flyte in order to build a next generation workflow platform to power all of their 20,000+ daily workflows.
Sonja Ericsson has been working as a Backend Engineer at Spotify for four years and has a Master’s in Computer Science and Engineering from KTH Royal Institute of Technology in Stockholm, Sweden. Prior to joining Spotify, her experiences include software engineering at Epidemic Sound as well as data analytics and integrations at Zimpler, a Stockholm based fintech company.
Sonja’s team is responsible for the platform that handles scheduling, orchestration and deployment of all data pipelines at Spotify — that’s 20,000+ batch data pipelines running daily, defined in 1,000+ repositories, owned by 300+ teams. For many years, most of these pipelines have relied on a tool called Luigi, which was built in-house by Spotify and open-sourced in 2012. In essence, it is a client-side orchestration framework (with a server scheduler) used to build data pipelines in Python.
In the old stack, users write their workflow code in the Luigi framework and use platform tasks provided by Spotify and open source Luigi through libraries. They would then build their workflow image based on a base image, also provided by Spotify, with additional dependencies. The complete workflow, including all tasks and dependencies, would get packaged into the image, and finally scheduled for deployment on Kubernetes.
Luigi has been serving Spotify well over the years, and has been widely adopted as a workflow orchestration standard in the data engineering community. In recent years however, Spotify has identified areas of improvement where Luigi was struggling to meet the company’s large scale orchestration demands. To stay competitive, it’s increasingly important for Spotify to have tooling that can stay fast and scalable while the organization is growing.
After evaluating different alternatives to Luigi for about a year, Spotify decided to go with Flyte which was built by Lyft and open-sourced in 2020. Flyte’s orchestration framework had the extensibility to integrate Spotify tooling and needs, great scalability, and support for multiple languages.
When using Luigi, Spotify usually factors in four main challenges; low feature penetration, dependency conflicts, inaccurate platform insights, and limited extensibility. Let’s go through these one by one, and how Flyte solves them for Spotify.
In Luigi, one complete workflow would be frozen within one Docker image, meaning any upgrade requires that image to be rebuilt. In turn, 1,000 pull requests would be opened for a single upgrade to happen. Opening these PRs was usually automatic but since PRs were often not merged, this caused low feature penetration.
✅ In Flyte, tasks can be executed by backend plugins. This allows for upgrades without user intervention, improving feature penetration.
In Luigi, all of the tasks were packaged within one Docker image and consequently share dependencies. These tasks usually have pretty complex dependencies which often results in dependency conflicts for the Luigi users.
✅ In Flyte, each task can be isolated in its own image or executed by a backend plugin without involving a container. This reduces the problem of dependency conflicts.
In Luigi, there is also a lack of easily retrievable structured entity information about the workflows or tasks, which means it’s hard to know what happens within an image. They parse code to figure out usages of tasks and arguments.
✅ In Flyte, entities are structured, type-safe and versioned. This allows for supporting many features such as understanding usages, workflow introspection, and caching.
✅ Entities can also easily be shared and reused. Spotify is planning to have a task catalog of reusable tasks instead of shipping tasks as libraries.
Luigi is a client-side framework, but Spotify often needs direct server-side interactions, for managing jobs in different systems without relying on containers. Spotify is also in frequent need for Java, while Luigi only supports Python. To combat this, they used to have a Java orchestration framework as well. However, any feature shipped in one language would then have to be shipped in the other, which leads to a lot of maintenance problems.
✅ Due to a protobuf interface, a Flyte SDK can be implemented using any language. This makes it possible to leverage varying languages for different use cases and mix tasks from different languages in a workflow.
✅ The Flyte backend is extensible through plugins, of which there are many available open source.
✅ Going from client-side framework to a platform approach provides more control, extensibility, and ability to introduce abstractions.
Spotify is currently using Flyte to run ~4,000 workflows each day, and ~100,000 executions across 175 teams. They are continuously working on integrating Flyte with Spotify’s internal ecosystem and achieving feature parity with Luigi. Spotify has also been working on features like authorization and lineage. The goal is to successfully migrate all of their 20,000+ workflows to Flyte.
Heroes of Data is an initiative by the data community for the data community. We share the stories of everyday data practitioners and showcase the opportunities and challenges that arise daily. Sign up for the Heroes of Data newsletter on Substack and follow Heroes of Data on Linkedin. This article is summarized by Emil Bring, based on a presentation by Sonja Ericsson from Spotify at a Heroes of Data meetup in September 2022.