Christof von Rabenau (00:04):
Hello. Welcome to today's presentation on Optimizing BigQuery. I'm your host, Christof von Rabenau. I'm a Cloud Architect here at Redapt working in our advanced analytics team.
I'm often brought into situations where organizations have adopted BigQuery and operate in the same way they would with a traditional, row-based relational database. BigQuery is a column-based database and, as a result, there are some key things to keep in mind when you're actually running queries in BigQuery.
I've got a presentation today. I created a Jupyter notebook that will take us through various approaches as to how we can optimize BigQuery. Let's jump into things. The first thing is we have to actually create a connection to our BigQuery. We're going to be using the Google Cloud Python API to connect to BigQuery. If you're comfortable with Python and comfortable with Jupyter notebooks, it's pretty handy.
Here we're making our connection to BigQuery. Then we have provided our project and we create our client connection to BigQuery. Before we begin, understand that BigQuery charges based on storage and compute, or, in other words, the data scanned. Storage in BigQuery is not just the primary driver of cost, though it's not free. Primary cost is actually the compute. We're going to be looking at how we reduce the compute load and therefore the dollar savings in our queries. What I've done is created a method we'll use over and over again that uses this functionality called dry run within the API. We can also do something similar using the code validator in BigQuery console. We'll touch on that later when we get into the workspace.
I've created this Python method that we're going to reuse throughout the presentation. I've set the dry run equal to true. We're going to be using the dry runs—we're not actually running the queries. We're saying how many gigabytes of data we actually use when we run this query. I've also set use query cache equal to false. It's one of the nice, handy things about BigQuery—if you run the same query over and over within 24 hours, you run it on cache data, as long as there's no update to the tables that the data is coming from. We don't want to be using it on cache, we don't want to be running on cache, because we want to push every single time we run these queries. Whether it's the first time, second, third, or the fourth, we want to be able to query against that.
The method also allows us to put a query to compare against. When we do something that's going to optimize, we're going to compare that to the un-optimized query to make sure that we actually see some value in that. The output of this method is kind of gross and rough. It's saying, “If we run this query once a day, every day for 30 days, how much is the usage for that?” Then the estimated costs are based on the base of $5 per terabyte scanned. It's just a tool to measure, but it gives us some way to measure it.
To begin, we're going to use a couple of tables. One of them is a fairly large data set from the Google Cloud's public data set of Stack Overflow. We can see that this is the schema of that table. It's got quite a number of fields and various different types of fields. If we look at how many rows are in this data table, we can run a query to ask that question. There are a little over 21 million rows and 36 gigabytes of data.
We're also going to use the data set from the Wikipedia page views on the public data set. That has a lot fewer columns, but quite a number of rows. Here we've got over 55 billion rows and over two terabytes of data. This gives us a good set of data to test our performance against. We've got the method here. I'm going to run that and now we've got this method within our notebook. We're going to look at the size of our initial tables. We're using the public data set from Stack Overflow—if we run that, we get a sense of what the table schema looks like. How many rows are in this table? We can run a query against that and we get 21 million rows and 36 gigabytes of data.
I created a data table off the Wikipedia page views from the Google Cloud public data set. That's got fewer columns, but many more rows. We've got nearly 56 billion rows and over almost 2.5 terabytes of data. We're going to do some baseline queries. We're going to Select Star and see how much it looks like if we select every column and row, and a limit of a hundred, against that Stack Overflow database. We would expect that we're scanning the entire data set. So, that's 33.8 gigabytes. If we do the same thing against the page views, we get 2.3 terabytes of data. If we run this every day, it's going to cost us $338 a month. This one up here might not bring us much stress over $5 a month, but $338 a month is probably going to cause some stress in the system.
What if we had our WHERE clause? Here we're selecting everything. It seems like we should probably throw a WHERE clause in and reduce the amount of data we're pulling in. If we select a WHERE clause and compare that to our straight up Select Star, we don't actually see a gain on that. If we do it against the super large data set there’s still no gain. The WHERE clause still has to scan every row in that view, because we're telling it Select Star. So, we're getting everything. Select Star is a BigQuery anti-pattern. You do not want a Select Star, especially on a table that's terabytes in size.
What we want to do is only return the data rows we're interested in in our query. But, you raise your hand and say, "Hey, Chris, what about previewing the data? I want to get a sense of what the data is." That's why we run a Select Star. BigQuery gives us the opportunity and tool to look at that data. If we go and look at our demo data set, and we look at our post questions, here's what the schema of that table looks like. Here are the details on that table. Then we can preview that data without incurring any costs. Here we can get our sense of the data. We can see what the form of the data is. We can see this creation date stamp and that sort of thing. So, we get a sense of what the data looks like without having to query against it.
Now, we know our Select Stars are not doing us any good, and we know that Select Star is a big anti-pattern in BigQuery. Take a look at what happens when we select only the columns we're interested in—compare that to our Select Star. If we are selecting every row with a Select Star, we're looking at 33, almost 34, gigabytes of data in a month. Whereas, if we only select the columns we're interested in, it's significantly less. The same thing happens if we do that on our Wikipedia. The gain here isn't as significant, but there are only four columns in this table and there are a lot more columns in the other table. Even so, it's still a gain of 1/3, which is definitely something we want to do.
Now, what happens if we put a WHERE clause onto our query? In this case, again, we don't see the gain we would expect. In our previous WHERE clause, we ended up having the same amount, the same scan of the data. Here we scanned the same amount of data. This query scans 1.018 gigabytes of data, whereas, if we scan without the WHERE clause, we're only scanning 3.317 gigabytes of data.
The challenge for BigQuery here is that we added two new columns to the data that needed to be scanned. We don't have our comment count, our tags count, or our tags within our select. So, it's scanning two more columns of data. That's why we're seeing that increase of data. If we just put our WHERE clause on one of the fields we were returning, like view count, we end up with the same amount of data scanned. If we've run our WHERE clause against our big table, again, we see the same performance, whether we have a WHERE clause or not.
Now we're beginning to see ways we can performance tune our queries in BigQuery. First Select Star is a big BQ anti-pattern. Also, recognize that WHERE may not produce the payoff you expect if BigQuery still needs to scan the whole table. It will cause you to scan more data if you're scanning rows or columns that you aren't returning in your select. Just be aware of that.
How do we get value from our WHERE clauses? That brings us to partitioning tables. BigQuery organizes data into buckets. If you don't partition a table, BigQuery will bucket your data based on the patterns used to query the data. But, you can enhance the partitioning by controlling it, with a time unit column, say you have a date stamp, a date, or date time column in your table. You can also have it partition on ingestion time. If you're streaming data, partition the data based on ingestion time. You can also provide an integer range and partition the tables based on an integer column.
I've created some partition tables. I took the Stack Overflow questions table and I partitioned it by comment counts. Now we can compare that partition request against a Select Star. We'll Select Star from the partition table and compare that to the Select Star in a non partition table. We can see there's not much gain in doing that. Likewise, if we do that against the partition, against the Wikipedia, we don't see a significant gain yet.
Now if we add our WHERE clause to our Select Star (and remember, here we're selecting all the columns, but we're going to add a WHERE clause) there are significant gains. Remember that our partition was on a common count between 10 and 20. If we go and look at that table and the details—I've partitioned this table on common counts starting from zero with a range end of 200, and a partition range interval of one. The questions with no comments would be in one partition, comments with one, etc. We're now making a request against a table only between buckets that include comment counts of 10, 11, 12, and up to 20. Now we've reduced the amount of data we're scanning and we're seeing that performance gain here.
If we run that against the larger data set, then I partition against views. Again, look at that gain. That is definitely worth it. Remember, this is just the Select Star. I'm still selecting all the columns, but only data that matches views equal 100. Partitioning your data is definitely a thing to do. If we take that to the next level, let's select the columns we want and add a WHERE clause. Look at that gain. We're barely running a cost at all on that. If we do the same thing on our large data set, again, there are very significant gains. So, we can add partition tables to our key performance tuning.
Let's go and take a look at queries without limits. In all my queries, I've limited it to 100. I didn't want to return a whole lot of data. What happens if we query without a limit? There's no performance gain by using a query with a limit. If we do the same thing on our partition table, there's no gain. Why is that? Well, the limit in BigQuery only limits what's returned to you in the end. It still needs to scan all the data to aggregate the data as you want it, and then it returns you 100 rows. So, it's still scanning all the data. If we do it on our columns-based data in our partition table, we can see the same thing. Make sure you're using a limit, like you would in a standard, row-based relational database. BigQuery is going to scan all the data first before it limits.
A case where you can get some value from limit, is when you do an order by. If you order the data and return a limit, you'll see some performance gain. We'll add that to our key performance tuning. Limit clause will only limit what is returned. BigQuery still scans all the data. That was one of the first lessons I learned when I was using BigQuery. I was told that limit doesn't help you. Bear that in mind.
There's another way we can help return and limit the data we're scanning, and that's with putting a clustered index on a table. We've got our buckets. Say we've partitioned our data. Cluster organizes that data so BigQuery can find the data more readily. I've created a clustered table, clustering on tags, but I'm not going to partition this table to begin with. We're going to pair this to a non-clustered, non-partitioned table. Remember, this is a cluster table on tags, non partition. If we run that query, we see a nice gain. What if we add a cluster to our partition table? We scan that and we're not going to see the gain, because we've already reduced the data set and now we're scanning pretty quickly. That's not going to give us as much gain as we might see in a non partition table. But, remember clusters. We're going to add clustering to our BigQuery tuning.
Google's BigQuery allows wild card table names. Here's an example: I've got BigQuery public data. This data goes back to 1948, I believe. I'm saying, “Scan all the data, all the data sets, the data tables, from 1900.” If we run that, we're scanning about 1.7 gigabytes of data. In the next query, all I've done is added a seven after that nine. Now we're only looking at the data from 1970. Again, a significant gain. If we say specifically 1979, we see another significant gain in performance. This takes us beyond the Select Star is an anti-pattern, to Star being an anti-pattern. Be as specific as possible at all times when you're making your query request to BigQuery. You might want to select all of 1970, but make sure that's exactly what you wanted to do, and not just select 1979.
Lastly, this is a more complex concept, but one you should bear in mind as you're building your queries. How are you doing your work? Let’s look at a sample data set from EPA Air Quality Measurements from stations in Ohio. That data set is 1.479 gigabytes of data to scan this query. Now, we can cross join that with cities in Columbus, Ohio, with utility information. We can get a result back that tells us what it looks like in Columbus, Ohio, because in our previous query, all we've got is Ohio. If we scan that against Columbus, we can get a result and that scan is getting us two gigabytes of data. All fine and good. But, what if we want to do the same thing for Cleveland and Cincinnati? Here's Cincinnati and here's Cleveland. Each time we run this request, it's costing us two gigabytes of data scanning.
Instead, we want to create a data set of the data we're querying against, as minimum as possible, but only the data that we want to capture. Recognize that we're looking for a min extract, the minimum year. We're looking for the maximum measurement. If we pull only the data we want into a table, this is going to cost us 10 gigabytes of data to create this table. That's pretty hefty. But, if we do that one time and we scan Columbus, Cincinnati, Cleveland, and even Duluth, Minnesota, each time it's costing us less than 0.6 gigabytes of data.
If you're going to do this same query for more than eight cities, you definitely want to materialize that data in a table first, rather than querying and hitting the larger data set every single time you run your query. You can see it's cost us a little over 10 gigabytes to materialize that data in a table. If we run it without materializing it eight times, it's a little over 16 gigabytes. We run it against the materialized table and it's a little less than five gigabytes. In querying a little more than eight cities, we'll actually get a return on our initial investment of 10.14 gigabytes. Now, imagine you're querying 200 cities across the United States. That savings is going to be significant.
In conclusion, here are some things to keep in mind when performance tuning BigQuery. One is that Select Star is a BigQuery anti-pattern. Select only the columns you need. Number two, a WHERE clause might not produce the payoff you expect if BigQuery still needs to scan the whole table, but partitioning tables and your WHERE clause are your friends. So, partitioning tables is another key performance tuning. Also, remember that a limit clause will only limit what is returned to you. BigQuery still scans all the data and all the columns first. Use clustering on tables to improve your performance. Be as specific as possible at all times. Also, look to materializing your queries in stages, so you can gain performance if you're doing the same query over the same data set multiple times.
I hope this has been helpful and brings you success with your BigQuery work. I look forward to hearing comments and feel free to give feedback. Thank you.