Hi from Data Management team at Eureka 👋
Last year I wrote about our migration to the new modern data platform Metis. In November 2022, we officially started using Metis in production data workflows. This was a long journey with a lot of effort and many achievements, we learned a lot during the migration!
This time I decided to focus on a more practical aspect of data platforms with tips for data platform engineers or managers. Here is a compilation of tips that we either implemented from the start, learned along the way or currently having in works for Metis.
Note: in this article, by ‘users’ I mean users of an internal data platform, most likely they are company employees. By ‘customers’ I mean users of a product or a service that the company is providing.
This, in my opinion, is the most impactful thing you can do for your data platform users.
When a company builds an application or a service, it builds a product. Such product has:
Now, what if we apply the same way of thinking to an internal data platform?
If we can apply product characteristics to a data platform, then why don’t we build data platform as a product instead of creating a set of connected technologies that make sense only to data platform teams?
Consider data platform as an internal product, its target market as your company and target audience as employees. Create name and logo for the data platform and advertise it internally under its name instead of a generic “data platform” term or the name of your data warehousing solution. Get all your data tools under this name and refer to them as a functionality.
For example, if your data platform product is named “Helios” (Greek word for “sun”), then your Tableau Server is the data visualization functionality of Helios. From the perspective of a user, they will be using a “Helios dashboard that aggregates Helios data”. It’s much easier for everyone to remember one product name and get the idea of its functionality, rather than trying to figure out what different data technologies do and how they are interconnected. When using Google Spreadsheets, you generally don’t think about what modules, resources and APIs it consists of, right?
I also recommend creating a landing page for your data platform product. A simple internal webpage with data platform’s name, logo and a list of links to functional components is sufficient. For better access and visibility, assign it a simple internal URL address (using the same Helios example from above, something like internal-domain.com/helios
or helios.internal-domain.com
would look great). Create a user manual for your data platform product and put the link on the landing page.
The most important thing here is having a single entry point to all functionality of your data platform. No need to research countless internal wiki articles and ask teammates for the information — anyone can start using the data platform just by visiting a link.
At Eureka, we used this method to build a new modern data platform product that we named Metis.
Thanks to the creation of Metis, we now have a simple, fast, resilient, reliable, cost-efficient and easy-to-use internal data platform. Our data teams actively use the name “Metis” instead of “data platform” or “BigQuery” or “GCP”, and many employees who do not directly work with the data platform have at least heard about Metis, its functionality and role in the company.
Check out last year’s article for a (slightly technologically outdated) overview of Metis!
In comparison with tip #1, this, in my opinion, is the most impactful thing you can do for yourself as a data platform engineer or manager.
Traditionally, when data engineers create data pipelines, they write instructions (code) that declare what operations to execute. Imagine a daily data pipeline in Airflow or cron-scheduled shell script that gets data from source A, transforms data into file B, uploads file data to table C, aggregates it into a table D and runs a refresh via API on server E. We have a clearly defined process consisting of 5 operations and expected output of this process.
As many data engineers know, dealing with such pipelines is a cumbersome process that costs a lot of time, effort and makes data engineering unnecessarily hard.
What if instead we put our focus not on what operations to execute, but what we want to exist? What if we embrace the declarative approach to data and start declaring data assets instead of pipelines?
Much like Terraform’s approach to infrastructure resources (discussed in tip #3), we can just define what data asset (table in a data warehouse, file, ML model etc) we want to create, how to create it and what dependencies on other data assets it has. This gives each data asset its own entity so we can work with them independently. When we want to materialize several data assets at once, we just need to specify which ones to materialize without the need to worry about their relationships. Since dependencies are already defined, each data asset knows from what other data assets to pull the data for processing, or to what data assets push processed data.
Moreover, declared dependencies enable automatic data lineage and operational data catalog without any extra effort. Operational data catalog is the one that the data platform team uses to work with data assets and their lifecycle.
dbt is one such popular tool that enabled asset-based approach to the data warehouse. Instead of creating data pipelines, you just write a SELECT SQL query for the table you want to create, give this table (model) a name, create optional YAML configuration and specify what other tables it depends on. Then just schedule your jobs and let dbt do the rest. Simple to create, maintain, and you get the data lineage out of the box.
But I believe the real breakthrough in this field is Dagster with its Software-Defined Assets approach. The general idea is similar to the one of dbt, but instead can be applied to any data asset. Whether it’s a table in a database, a data extract from an API, a file in a cloud storage bucket or a machine learning model — you just need to define it in Python code and specify dependencies. Create a job with a selection of assets you want to materialize and add a schedule or a sensor. Dagster will automatically materialize selected assets while preserving order of dependencies. Moreover, you can import data assets from other tools, including dbt!
Dagster and dbt are probably two most powerful tools that we have implemented in Metis. They replaced complex and fragile Airflow pipelines, custom shell scripts and a couple of detached data tools. We got clear observability over our data assets thanks to the automated operational data catalog, our daily data operations got much easier, and we visibly increased the reliability of our new data platform.
I really recommend spending some time studying Dagster and understanding asset-based approach to data to see how it can improve your current data workflows. It was such a huge positive upgrade for us so that it’s hard to imagine going back to previous ways of doing data engineering. There are detailed explanations of all features in docs and blog posts such as this one.
If your tech stack allows it, it pays off to define data platform resources and automate their lifecycle in an Infrastructure as Code (IaC) tool, such as Terraform. The amount of clarity, simplicity and peace of mind this approach brings is difficult to overestimate.
It’s easy to start with the IaC approach to data platform resources. Spend some time studying the tool, configure CI pipeline that runs the tool and import existing resources. Next time you need to add, modify or delete something in your data platform, just modify the code as needed, create a Pull Request, merge approved changes and confirm that the deployment was a success.
Note that as with everything, balance is important. For example, most likely it won’t make sense to define a resource for absolutely every table that you want to create in your data warehouse. Instead, consider letting other data tools (such as dbt and Dagster) to create table resources on the fly. Going too granular will add a huge operational overhead, and not putting all main resources in code makes the data platform less reliable, messier and prone to incidents. But if done right, such IaC repository becomes both your data platform’s technical inventory and management plane.
In case of Metis, we went with defining all technical infrastructure resources down to a dataset level and access controls with Terraform and putting BigQuery table definitions in our dbt repository. This creates a simple abstraction: when we need to add infrastructure or configure access controls — we do it in the Terraform repository. When we need to add data to Metis, then it’s either dbt repository (for SQL-generated data) or Dagster repository (for non-SQL-generated data).
When you’re in charge of a data platform, you’re also in charge of data security, data access and data processing costs, among other things. To keep the data safe and costs down, make sure to set:
As always, balance is important. Restricting your internal users too much will halt a lot of work and giving them too much freedom to use the data platform creates security and financial risks.
I recommend going with a moderately strict limits first and create a simple way to relax those limits when needed. Changing current restrictions might need approvals from upper management or other stakeholders, so consider how to make that communication simpler as well.
Also consider how to apply the controls above. For example, you can control access to data on a data warehouse, dataset, table and/or column level. Consider which option makes the most sense for your organization and implement accordingly. In our case, we found that controlling access on a dataset level while protecting certain columns with column-based policy tags works well as a balanced solution.
For cost controls, having a moderate total daily data scan limit and making sure users know about it provides a good starting ground. Separate on-demand workloads into its own quota pool and set the limit accordingly. On-demand workloads in a data platform are usually not critical, so the quota will protect your data platform budget. And when a critical use-case to scan a lot of data comes up, you can always temporarily increase the quota limit.
When implemented right, your security and financial teams will be very grateful to you, and your data operations will be easier, safer and cheaper.
This tip is written based on BigQuery, but I believe other data warehouses have the same or similar functionality.
Partitioning and clustering of tables is the one method that is easy to implement and gives you huge performance and cost saving benefits right away. When you have a lot of data in a table, there is no reason not to use either or both.
Partitioning allows you to logically split the data into different partitions based on a date or a number. By partitioning a table by day using a DATETIME column (for example, created_at
), the data for, let’s say, December 18, 2022 (2022–12–18) will automatically land in its own partition. Then, when you want to retrieve the data for this day and specify the condition WHERE DATE(created_at) = DATE("2022-12-18")
, the SQL engine knows that it needs to scan only the “2022–12–18” partition for data. Since only one partition of the whole table needs to be scanned, the query becomes much faster and much cheaper.
Same goes for uploading data into a table. No more time-consuming and costly UPDATE or MERGE operations — when you need to rewrite data for a specific date, just delete the partition and re-upload the data!
If you have data in Cloud Storage, you can also similarly partition the data using Hive partitioning layout.
You can also partition existing tables by temporarily stopping data ingestion into a table, moving all data into a newly created partitioned table, deleting the old table, and renaming the new table to the previous name.
What if you have a lot of data in a single partition or a big table without a column to partition on, but still want to improve performance and reduce costs? Clustering allows you to further split data into clusters. Let’s say you have a table with purchase transactions per store in different cities, and your most common use case is to calculate total transactions per specific city per store. By clustering the table by city
and store_name
columns you automatically optimized all queries with WHERE city = X AND store_name = Y
conditions, since now they scan only specific clusters of the table. Combine that with partitioning and now you have very efficient queries without rewriting any SQL!
Before Metis, we only had a manual partitioning method and no clustering configuration on our tables. By partitioning and clustering large and heavily used tables we achieved huge boost to performance and huge cost savings. In one case, we managed to bring down the cost of a specific use-case by 99.8% and reduced query runtime by 75% !
That said, there is no need to partition and cluster absolutely every table. If you have a table with 1000 rows of data that you don’t expect to grow, then you won’t notice any benefits from partitioning or clustering. Identify the tables which will give you most gains and start from them.
There is no tool that fits all use-cases. Whether it’s a data warehouse solution, ETL or visualization tool, the choice of which one to use will depend on the goals, needs and current situation in your organization and data department.
Save your team and company from future troubles by thoroughly evaluating your situation and available tools in advance.
Most likely there will be a need to make a trade-off — consider the one that has the least impact. For example, existing ETL tools such as Fivetran or Airbyte might easily handle many various simple batch data sync workflows, but they might not be able to process huge amounts of data in short time fast enough. Research if there is a solution that can handle both use-cases, evaluate if it makes sense to add a custom solution, or go with a completely different option.
This is the part where it’s easy to go wrong and then suffer from the consequences of that choice when the tool already has too many dependencies in your stack. A well thought out choice from the start can save you from a costly and lengthy migration later.
Operational data catalog mentioned in tip #2 is very useful for the data platform team, but it’s not suitable to be shared with users. After all, it’s probably not a good idea to give the keys to managing all data assets to users.
Instead, it’s a good idea to create a simple data catalog that focuses on describing the data you have and enables data self-discovery for your users. Having a centralized data catalog will greatly help in reducing the amount of “Where is X data?” questions, and will give a complete visual picture of all existing data in your company, thanks to the data lineage and descriptions of your data assets.
What is the difference between the two types of data catalogs?
Getting started with a user-facing data catalog is a straightforward process:
Don’t forget that this is an active catalog — it needs to be maintained, cleaned and categorized periodically. It might look like an additional operational burden on the data team (in some cases requiring putting on a completely new ‘data steward’ hat), but in reality it can greatly help to reduce operational overhead and the number of data related questions. It can also improve overall data operations and knowledge about the state of data inside the company.
Do you have a screen that you can open and understand the current health of the data platform just at a glance? It’s common for infrastructure teams to have dashboards for checking an application’s or service’s infrastructure health, so it should also be common to do the same for data infrastructure.
Creating a unified monitoring center for a data platform doesn’t take much time, but provides great observability, makes data operations much easier and helps prevent or recover faster from incidents.
This can be done either manually or specified in code (with tools like Terraform). Now you not only have automated notifications when something went wrong but can always get current status (and peace of mind) just by opening the monitoring center’s dashboard.
Data platforms are complex structures. The bigger the company the more so. Many stakeholders across different teams and departments directly or indirectly use internal data platform. Make this complex structure as easy for them to use as possible!
There are many ways to simplify a data platform and it largely depends on the company, including current tools, methods and approaches. The secret is to think from the perspective of your internal users — same way how product’s UX is built from the perspective of a customer. Some ideas:
dim_
or fact_
prefixes to tables if you are using star schema design. Do the same for code, for example having all source assets to follow a naming pattern like <source>_<region>_<table_name>
or similar.events_20221218
that were created with a date suffix per each day. Imagine how many tables you need to go through when you have a handful of such tables and a couple of years of data, each with a suffix!events
table partitioned by day. Much easier for users to use the data from just one table, and for the data management team to manage data assets.Each company is different, so get to know your internal users well and think about how you can make their data operations easier!
This is a universal tip for any process where you manage a collection of something — data assets in a data platform, personal notes, task lists, or just items in your fridge. Check the inventory of things that you have and remove the ones that are no longer needed. It’s normal to throw away an expired can of beans found in a fridge — why not do the same and delete all those test_<something>
tables in the data warehouse that you created a couple of years ago and which are no longer needed?
As a data manager, create a list of data assets that don’t look like being used, and schedule periodical sessions with data platform users (or just ask them on Slack\Teams) to check whether those assets are needed or can be deleted.
To keep your development environment clean and tidy, another useful approach is to attach a Time-To-Live (TTL) configuration to personal datasets or other resources to automatically delete stale tables. Just make sure that your data platform users know about it!
I hope the tips above can help you create a more efficient, simpler and better data platform. They worked for us, and I believe can work for many other companies. However, remember that each situation is different, so consider your current state of the data platform and the situation in the company before implementing these tips.
Bonus tip: No matter how technically good your data platform is, in the end, communication is the key to success. You must advertise the data platform and explain the benefits of using it to get the buy-in from users and make it a successful internal product. Good luck!
Similar Journal