Optimizing BigQuery Slot Usage with Temporary Tables: A Cost-Saving Strategy
Managing costs efficiently in BigQuery is crucial to staying within budget. If you’re executing BigQuery queries from your code, this article will provide insights into reducing slot usage, ultimately saving you money.
Understanding the Lifecycle of a BigQuery SQL Query
To grasp how BigQuery works behind the scenes, let’s dive into the lifecycle of a SQL query:
- Query Submission: You submit a SQL query to BigQuery via a client library, CLI, or Console.
- Job ID Assignment: Upon query submission, BigQuery assigns a unique Job ID to your query job, which you receive in response.
- Query Processing: BigQuery’s distributed infrastructure optimizes and executes your query as you wait for the query to complete.
- Result Retrieval: Once the query job completes, you can retrieve the results using the associated Job ID.
All these steps consume BigQuery slots and resources.
Identifying the Problem
I have a Kubernetes cronjob written in Node.js, which runs daily, fetching data from BigQuery (involving a complex query) and forwarding it to PubSub after processing. The job’s work is discussed here but to give a gist:
- Run BigQuery and get jobId
- Wait for Query to finish on that jobId
- Once it completes, read the data in paginated format from that jobId
- Process it and publish it to Pubsub.
While analyzing BigQuery slot usage, I noticed an unexpectedly high bill, exceeding my budget by 1.5 times. I initially expected the cost to align with a single daily query run.
Upon investigation, I discoverd that the issue rooted from the inadvertent resubmission of queries to BigQuery. On certain days, the query was submitted twice instead of once. Delving deeper, I saw Kubernetes’ internal workings, which reveal that pods or servers running jobs can be evicted and replaced.
The problem lies in the static nature of the job steps. When a pod is evicted and replaced, the new pod redundantly submits the same query again, driving up costs.
Implementing a Solution
To mitigate this problem, I checked BigQuery’s documentation and learnt that once a query is submitted, BigQuery stores its data in a temporary table for 24 hours. This temporary table can be accessed using the Job ID provided by BigQuery.
The solution involves using the initial Job ID to prevent query reruns. I am now leveraging Redis to store the Job ID associated with a key and check for its existence before submitting a new job to BigQuery. The revised workflow is as follows:
- Check for a job associated with the Redis key (let’s call it “bq-job”).
- If the key exists, retrieve the Job ID from Redis and follow steps 2–4 from the previous workflow.
- If the key doesn’t exist, submit the query to BigQuery to obtain a new Job ID. Store this Job ID in Redis against the key with a time-to-live (TTL) setting.
- Note: The TTL ensures the job will run again the next day. Otherwise, the next day the job will again find the key present and not run the query.
4. After storing the key, proceed with steps 2–4 as outlined in the previous workflow.
This new approach has successfully brought the costs in line with my expectations. Furthermore, I’ve also added an extra step to store the page token (indicating where data retrieval stopped) to prevent duplicate processing, a topic to be covered in a separate article.
This optimization strategy highlights the flexibility of accessing submitted BigQuery jobs using Job IDs. Leveraging temporary tables allows for efficient data retrieval and cost savings. You can use various storage solutions beyond Redis.
Setting TTLs is essential for seamless execution on subsequent days but you can also avoid that using a dynamic key including the date in it which changes every day.
With these steps, you can make the most of BigQuery’s capabilities while managing costs effectively.