How to Optimize Your Data Warehouse Strategy with Cloud Amplifier

Cloud Amplifier makes it simpler to transform, visualize, and move your data. It can also help save money by making these tasks run smoother in your cloud warehouse.

Jul 8, 2024 - 05:27

How to Optimize Your Data Warehouse Strategy with Cloud Amplifier

Domo’s Cloud Amplifier is changing the way people can pull together data from different systems, so they can make a real impact with less hassle. Cloud Amplifier works with the data infrastructure you already use, making it simpler to transform, visualize, and move your data around.

This tool can also help save money by making these tasks run smoother in your cloud warehouse. Let’s talk about how Cloud Amplifier might affect your cloud warehouse costs, plus strategies for optimizing them.

How cloud warehouse providers typically charge

Cloud warehouse providers charge based on the number of compute clusters, also known as virtual warehouses, used to process data queries. Each cluster is allotted a specific amount of CPU power, memory, and temporary storage for running queries.

These clusters are designed to operate for minimum set periods and have a set idle time. In other words, if no queries are running, they will automatically shut down after a certain time. This setup helps determine the cost of using cloud warehouse services.

Just a single query can get a compute cluster running, which costs money. That’s why database administrators and data engineers keep a close eye on when and how queries are made. They want to prevent spending money on unnecessary compute clusters and find opportunities where fewer resources could get the same job done. This careful monitoring helps avoid unnecessary costs and keeps everything running efficiently.

Case Study with Snowflake

Task: Two queries that each take about five seconds

Scenario 1: Dual warehouses

Let’s imagine using Snowflake with two queries, each taking about five seconds. If each query runs on its own extra-small warehouse (i.e., compute cluster) and those warehouses turn off after being idle for 60 seconds, you end up paying for a bit more time than you might expect. In this case, since there were no other queries to keep them busy, each warehouse ran for 65 seconds total. Since two separate warehouses were used, that means you paid for 130 seconds of compute time altogether.

Scenario 2: Single warehouse

In this second scenario, imagine running the same two queries, but this time using just one virtual warehouse. Since this warehouse has enough CPU, memory, and storage to handle both queries at the same time, they are processed together. This setup only uses 65 seconds of warehouse time for both queries, effectively cutting the cost in half compared to using two separate warehouses.

Choosing your warehouse strategy: Cost allocation vs cost savings

So, why would you want to have multiple compute clusters running at the same time? When managing data, using multiple compute clusters can make sense for a few reasons:

Warehouses can serve as excellent proxies for cost centers and budget allocation.
You may have different pools of compute clusters for different types of jobs.
You may have caps that limit the amount of time certain compute clusters can run.

We now have a tradeoff to consider. If your company’s strategy is primarily to reduce costs wherever possible, it’s smart to use clusters that are already running. You’re already paying for the CPU, memory, and storage when a compute cluster is running. Why not make full use of them? This way, you maximize what you’re already investing in.

If your strategy prioritizes having a clear understanding of costs for specific projects or departments, you may choose to set up separate compute clusters tailored to these needs.

This approach improves your visibility into expenditures, aligning spending with the areas you’re monitoring closely. However, the risk is running multiple compute clusters that aren’t fully utilized, operating below their capacity—an inefficient use of resources.

So, what do I recommend? Often, it’s more cost-effective to utilize existing compute clusters instead of setting up new ones specifically for Cloud Amplifier. While there might be good reasons to create dedicated clusters for Cloud Amplifier connections, these can often lead to higher costs.

On the other hand, reusing existing clusters can provide the same benefits at a reduced cost, making Cloud Amplifier a more economical choice without compromising on value.

Optimizing with Cloud Amplifier

Cloud Amplifier freshness checks and TTL cache

Understanding how warehouses charge for queries, you can now be more mindful about using Domo’s data freshness checks to balance the need for the most up-to-date data. By default for Snowflake, Cloud Amplifier will send a query that looks something like this:

Cloud Amplifier’s default query for to Snowflake

A single query will be made for all tables in a schema. If you have registered DataSets in Domo for tables spread out across 10 different schemas, you will see 10 freshness check queries (one for each schema).

Most Cloud Amplifier engines will do a similar query to check for data freshness (though BigQuery uses an API, and no query to information.schema is necessary). This data freshness query will require that a compute cluster start. The results of this freshness check query are used for several operations:

Determining whether an ETL needs to execute now that new data is available
Evaluating for any data set alerts now that new data is available
Determining whether Domo needs to query the cloud warehouse next time data is requested (i.e., a card in Domo is viewed) or if that same data already exists in Domo’s cache.

NOTE: Domo caches the results of individual queries, not the entire data table. The cache TTL is set to 15 minutes by default

Compute cluster costs from freshness checks

At the time of writing this blog, these freshness check queries will execute every 15 minutes, and usually require a compute cluster to be running in the cloud warehouse (again, BigQuery being an exception). If a compute cluster isn’t already running, a compute cluster will be initiated. This then represents a potential cost to your business, depending on whether or not you already have a compute cluster running.

So, consider carefully how often your data is already updating. Setting the freshness check interval to be more frequent than your data’s update schedule could lead to unnecessary compute cluster costs.

Another consideration is whether you want the freshness check queries to run at off-hours when people aren’t necessarily viewing or using the data. For example, continuing to perform freshness check queries at night-time hours when people aren’t typically looking at Domo can lead to hidden compute cluster costs that add up quickly.

Navigating Cloud Amplifier and cost considerations with Domo

Cost considerations can seem daunting, but Domo is here to support you with the right tools to handle these discussions effectively. We’re committed to assisting you through this process and will keep providing a series of upcoming blog posts to guide you.

If you have specific needs or questions, don’t hesitate to reach out to your Domo account team to explore the details of your use case.