How to reduce BigQuery costs: 10 proven life hacks for optimization

How to reduce BigQuery costs: 10 proven life hacks for optimization
0
250
6min.

Google BigQuery is a powerful tool for analyzing large amounts of data, but as your data grows, it becomes increasingly important to optimize its performance.

Here are a few tricks to help reduce BigQuery costs and ensure efficient and cost-effective data analysis.

Avoid SELECT in queries

In the past, using SELECT * and selecting specific fields didn’t matter much because the entire dataset was read (and you paid for it in full), and then the data was filtered and displayed to you.

In BigQuery, the choice of columns directly affects your costs. By selecting only the fields you need, you can significantly reduce query costs and improve query performance.

Instead:

SELECT *

FROM `project.dataset.events`

Use:

SELECT event_date, event_name, user_pseudo_id

FROM `project.dataset.events`

By selecting specific fields, you reduce the amount of data processed and increase the speed of query execution.

Use rough estimates for large aggregations

You don’t always need to have an absolutely exact number when working with large data sets. For example, instead of the exact 10,601,340, most analyses will be just as useful with an approximation of 10,600,000. This trade-off can significantly reduce costs.

BigQuery offers functions for approximate counts, such as HLL_COUNT_DISTINCT and APPROX_COUNT_DISTINCT, which use HyperLogLog++ to estimate unique values. HyperLogLog++ is a smart algorithm that estimates the number of unique values without having to look at all the data. It is fast, consumes less memory, and is usually accurate with a small margin of error. This error depends on the settings, but usually provides a relative error of about 1% for large datasets, making it quite reliable for most analytical tasks.

For example, imagine a GA4 dataset where you need to count unique users:

SELECT APPROX_COUNT_DISTINCT(user_pseudo_id) AS approx_user_count

FROM `project.analytics_123456.events_*`

WHERE _table_suffix BETWEEN ‘2024-01-01’ AND ‘2024-01-31’

This query will estimate the number of unique users during this period, reducing computational overhead compared to using COUNT(DISTINCT user_pseudo_id).

Start by running queries with sample data

Often, experienced analysts run queries that process petabytes of data, only to realize that they missed a few fields and had to run everything again.

However, we recommend using TABLESAMPLE, where you can query only a certain percentage of your table and when you refine your query, just comment out a single row.

SELECT event_date, user_pseudo_id

FROM `project.dataset.events`

TABLESAMPLE SYSTEM (1 PERCENT)

WHERE event_date = ‘2024-01-01’

By adding just the TABLESAMPLE SYSTEM (1 PERCENT), your BigQuery bill for this query will be reduced to 1% of the full query.

We should also note that this only returns 1% of the data, so it’s great for work-in-progress queries, but not for production queries.

Implementing the semantic layer

The Semantic Layer is an intermediate layer between your dashboard and BigQuery. It allows you to create rules for caching, providing already queried data without having to call BigQuery again.

Let’s say you ran a pretty heavy query in the last 30 days and it returned terabytes of data. BigQuery charges for each query, so if three of your colleagues run the same query, your project will be charged four times for the same query.

The Semantic Layer adds a caching layer between dashboards and BigQuery. You can configure it so that a query is executed only once for a project, and all subsequent requests receive cached results that are not run in BigQuery. This saves three additional query executions and the corresponding costs. [If you’re wondering how this works in practice, it creates a kind of “pseudo-warehouse” from which you get the data, rather than directly from BigQuery.”

BigQuery has its own caching history, but it works at the level of a single user, not at the project level. So, if multiple users query the same data, they don’t get any performance boost. It’s only when the same user requests the same data that you can see performance improvements.

Looker (not Looker Studio) has its own semantic layer called LookML, where you can add a lot of context to the data to make it easier for non-technical users to get answers.

If you don’t use Looker, there are a few new players that do a good job with semantic layers. DBT and Cube are making great strides in this direction.

Separate and cluster your tables

To get the most out of your GA4 data, you need to realize that the original GA4 data structure doesn’t quite follow best practices. Splitting the data into parts (where a new table is created for each day) doesn’t allow you to take full advantage of partitioning, which is a great optimization technique for databases like BigQuery.

To fix this, I recommend rebuilding the GA4 tables into a single joined table and adding the following optimizations:

  • Partitioning: Partitioning organizes the table by a specific field, which makes it easier to retrieve data. For GA4, I usually partition by event_date, and for larger clients, sometimes by event_timestamp (hourly). This setting optimizes queries for daily or hourly data. If you are using a single table, make sure that each query has a separation filter to avoid full table scans and reduce query costs.
  • Clustering: Clustering groups similar values in a table. I minimally cluster by user_pseudo_id and event_name. This allows you to optimize queries when analyzing user paths or selecting specific events.

While creating the table, we also unpack a few useful data points, such as page_location, and add new dimensions such as landing_page and page_path. These improvements improve query performance and provide richer insights.

Use incremental strategies

If you’re loading data into BigQuery using a third-party connector, it’s important to make sure that only new data is loaded and not replacing existing data. This approach, known as an incremental strategy, can significantly improve performance and reduce costs.

Believe me, this is extremely important. We’ve seen situations where petabytes of data have been overloaded, which meant replacing several years of data, when all that was needed was to add yesterday’s data. By using an incremental approach, you save time, storage space, and processing costs.

Pre-aggregate data

Data like GA4 is not optimized for analytics out of the box. Pre-aggregating frequently used metrics can significantly improve performance and reduce query costs. For example, you can create pivot tables to store pre-calculated metrics, such as

  • Number of events (for example, total number of events per day)
  • Purchase revenue (e.g., total revenue grouped by day or campaign)
  • Number of new users (e.g., unique users identified on first visit)
  • Number of active users (e.g., daily or monthly active users)

While pre-aggregation is great for these metrics, you should be careful with user data that depends on the number of sessions or unique users over a period of time. Aggregations based on already aggregated data, such as calculating “sessions per user” per month, can lead to inaccuracies due to doubling of users in different periods.

This method helps simplify queries and speed up reporting for frequently used metrics, ensuring accuracy and cost-effectiveness.

Enable history-based optimizations

BigQuery can gradually improve query performance through history-based optimizations. For this feature to work, make sure it is enabled in your settings, which will allow BigQuery to use historical execution plans and statistics to improve performance.

To enable this feature at the project level, you can use a general template or a specific example:

ALTER PROJECT `NAME OF PROJECT`

SET OPTIONS (

`region-location.default_query_optimizer_options` = ‘adaptive=on’

);

If you have a project called growth-runner in the EU region, your request will look like this:

ALTER PROJECT `growth-runner`

SET OPTIONS (

`region-eu.default_query_optimizer_options` = ‘adaptive=on’

);

This will enable adaptive optimizations for your queries, which will increase efficiency. You can learn more about history-based optimizations in the Google Cloud documentation.

Physical and logical storage accounting

BigQuery charges for physical and logical storage, so it’s important to understand the difference between the two:

  • Physical storage refers to the actual space your data occupies, including compressed data and historical snapshots (time travel)
  • Logical storage is the size of the data after it is fully decompressed, including all columns, even if they are partially filled.

The time travel feature, which allows you to query previous states of a table (including accidentally deleted data), can significantly increase physical storage. Although this is a powerful feature, it is worthwhile to periodically review historical snapshots and decide whether they need to be stored. For example, reducing the time travel window from 7 days to 1 day can reduce physical storage costs without significantly impacting most workflows.

Settings are managed at the dataset level and can be changed by simply clicking “edit” and finding the appropriate options in these sections.

By being aware of these differences, you can manage your spend more effectively using BigQuery’s powerful features. For more information on time travel, see the Google Cloud documentation.

Using BigQuery slots to control costs

If your BigQuery spend exceeds $1,500 per month, it may be worth investing in BigQuery slots. Slots provide a fixed-price model that offers predictable costs and enough computing power for heavy workloads.

With slots, you purchase a specific amount of BigQuery compute power, measured in slot hours, for a fixed monthly fee. For example:

  • Hourly fee: Queries are charged for the amount of data processed. For large-scale analytics, this can lead to unpredictable and high costs.
  • Fixed price: Instead, you pay a fixed amount (e.g., approximately $1,500 per month for 50 slots) and are limited to using the slots to fulfill requests.

Example:

If your team runs many complex GA4 queries every day, processing hundreds of terabytes of data, your monthly costs can exceed $2,000. However, by purchasing 50 slots for ~$1,500 per month, you’ll have a predictable expense each month. Please note that with slots, queries may take a little longer to complete during peak loads as the capacity is shared among all users.

To learn more about BigQuery slots and determine the right capacity for your needs, check out the BigQuery slot pricing documentation.

Conclusion

Optimizing GA4 data analysis with BigQuery requires a combination of technical best practices and strategic planning. By applying these techniques, you can significantly improve query performance, reduce costs, and gain deeper insights from your data.

If you need any support, don’t hesitate to contact us, we’re happy to help.

Share your thoughts!

TOP