As we prepare for our workshop “A Big Data Adventure on the Google Cloud Platform” at Crunch Data Engineering and Analytics Conference, we have started to collect some tips and tricks when it comes to using Google’s Big Data platform. This first collection of advice shows how to control your costs when using BigQuery (BQ), discusses the kinds of built-in tools you have access to, and explains the related best practices. In the second blog post, we show how you can boost your BigQuery performance. Read this blog post to find out how to take control of your Bigquery costs.
What is the difficulty in BigQuery costs?
The BigQuery pricing is usage-based, you pay for the amount of data you store, query, and insert by streaming. This seems fair, although it has some consequences. When you have practically unlimited resources — as in the case of BigQuery — this pricing policy becomes a double-edged sword: it is easy to spend a lot if you are not cautious enough. When it comes to budget planning, you should know how much data you will store and process in the future and this can make it challenging to estimate your costs. Fortunately, there are some tools and best practices that can help you to control your costs.
Probably the most difficult is to plan the costs if you’ve never actually used BQ. Google provides a pricing calculator tool that helps in these situations. It estimates your future costs based on the parameters we’ve already mentioned: storage, streaming inserts, queries. Of course, to get a result, you will still have to estimate the size of the datasets involved.
How to prevent overspending
BQ provides a couple of built-in tools to continuously control your spending and stop you going over budget.
1.Query-level cost control
Check the amount of data BQ will process during the query before actually running that query. Simply click on the green exclamation mark (Validator) in the right bottom of the window. If you also add the BQ Mate extension to your Chrome browser, you will see how much the query will cost, in USD.
2. Project-level cost control
Set the project-level soft limit by requesting a billing alert. When you exceed the threshold you have set for the current month, the billing administrators will receive an email notification. Set the project-level hard limit by maximizing the number of bytes processed per day within the project. If you enable the BQ custom quotas feature, you can specify this limit in 10 TB increments. To activate this feature or to modify the value, submit the BQ Custom Quota Request form. When you exceed the quota, queries will return errors.
3. User-level cost control
Set the user-level hard limit within a project by maximizing the number of bytes processed per day by a given user. The method is the same as in case of the project-level hard limit, except that the error message is different when you exceed the quota.
4. Billing-account level cost control
If you have multiple projects, and you want to monitor the overall budget usage, set billing alerts for your entire billing account. The process is the same as point 2a.
How to monitor the costs
- There are also options to monitor your spending:
- Follow your monthly data usage on your project’s Cloud Console dashboard on the Billing panel. Be aware though, that the costs will appear only within a couple of days of usage, not in real time.
- Under the Billing/History menu on the Cloud Console, you can get a more detailed view of your spending.
- Export your detailed BigQuery audit logs, and visualize your spendings in Data Studio as Mike and Ryan suggest in this great post.
Is that all?
Not really. Beyond these built-in features, there are several best practices that can help to reduce your costs.
- If possible, always consider partitioning your data and storing it in sharded or partitioned tables. If you have chosen the right partitioning column, usually just part of the shards or partitions has to be read by a query, so the costs will significantly decrease. This way, you can save money on the storage side, too, with the built-in feature in BQ that puts those tables you haven’t changed in the last 90 days into a cheaper storage category.
- Try to avoid SELECT
- option on big datasets. Don’t be lazy to write down the name of the columns you really need. BQ is a columnar storage. You pay for the amount of data your query needs to read. Do the math: fewer columns selected means fewer columns need to be read, which means lower costs.
Set the data expiration date on datasets. After a given time, all the tables in the dataset will be deleted so you won’t have to pay for storage. This is especially useful if you are in an experimentation phase with BQ. Another benefit of this feature is that you can prevent the overgrowth of tables and datasets in your projects.