Uncover More GSC Data Using API and BigQuery

May 14, 2024

9

min read

Google Search Console provides valuable information to improve search performance and is probably one of the main tools you use if you work in SEO. However, the GSC UI has quite a few limitations: missing click data, limited filtering options, and it only includes a 16-month history unless you save the data yourself. 

Fortunately, there are alternative ways of looking at your GSC data: the API and the bulk export to BigQuery. 

In this article, I’ll explain how they work, the main differences between them, and which one you should use based on your needs.

GSC data: the API and the bulk export to BigQuery. 

The GSC UI

In most cases, we can use the UI, but it comes with the following issues: 

  • The UI won’t provide more than 1,000 rows, no matter what tab you are looking at. This won’t affect you if you’re working on small projects (such as your own site), but it will make things harder for most projects with more than a hundred pages. I will return to this data limitation later in the article and discuss more about why this happens.

GSC
  • You can’t apply more than one filter to the same dimension, such as filtering search performance data for France and Spain, without using a regular expression. And using a regular expression is not always an option. 

  • You can’t group data per week or month, as we were used to in Universal Analytics. Having daily data per default makes comparison challenging to scan. 

  • You can’t display results for combined dimensions. You can get a list of top queries, but it is impossible to get the top queries for more than one page in a single table, for instance. Which would save a lot of time for new projects. 

Let me be clear: the UI is wonderful, but you can’t use it for everything. Sometimes, you need more granularity or advanced features, and it fails short. 

GSC API

The API is the obvious solution when you’re limited by the UI. If you’re not familiar with the steps needed to access the GSC API, you can take a look at this article with all the required information, as this is not the main focus of this article. 

The UI actually uses the API but displays just a part of the available information. This is especially obvious when you look at the documentation and realize that, by default, the API returns just 1,000 rows of data. Which is the limit I mentioned previously the UI has.  

Which is the limit I mentioned previously the UI has

There is a point I want to address straight away, though. There is a misconception that the API can only return 50,000 rows. I honestly don’t know why I’ve read that so many times, but it is entirely false. If you do it correctly, you can get millions of rows using the API. It is true that the API will return 25,000 rows per call, but you can use pagination to get the rest. 

That being said, the API is wonderful, but it comes with several data discrepancies that you should be aware of. I’ll explain them using my Python library which you can use freely to run your own tests. 

Let’s create a simple request to get the traffic per day for one of the web property I have access to. 

webproperty = account['https://www.example.com/']

gsc_data = (
   webproperty
   .query
   .range('2024-01-01',"2024-01-31")
   .dimensions(['date'])
   .get()
)

The result looks like this: clicks / impressions per day, something we could get easily through the UI, by the way.

GSC, date, clicks, impressions

In that example, the total of clicks for January 2024 is 893,691. So far, so good. Now, let’s run a similar, but including only branded queries: 

gsc_data = (
   webproperty
   .query
   .range('2024-01-01',"2024-01-31")
   .filter('query',"my_brand",'contains')
   .dimensions(['date'])
   .get()
)

And another one without the branded queries: 

gsc_data = (
   webproperty
   .query
   .range('2024-01-01',"2024-01-31")
   .filter('query',"my_brand",'notContains')
   .dimensions(['date'])
   .get()
)

We get the following total: 

  • Branded: 48,613 clicks 

  • Non-branded: 387,984 clicks 

You may have realized that there is a problem here. If we sum these two numbers, we’re far from the 893,691 total we should have. We’re missing around 50% of our clicks. 

Google does provide an explanation for this discrepancy (source):  Some queries (called anonymized queries) are not included in Search Console data to protect the privacy of the user making the query. Anonymized queries are those that aren't issued by more than a few dozen users over a two-to-three month period. To protect privacy, the actual queries won't be shown in the Search performance data.

I’m not sure if we must trust an explanation stating that 50% of our traffic is basically generated through (really) long-tail queries, but that’s not my point here. When you use the API and filter based on queries, the total won’t match. 

Will we have a similar situation using a URL path? Let’s check the numbers with a specific path: 

gsc_data = (
   webproperty
   .query
   .range('2024-01-01',"2024-01-31")
   .filter('page',"path",'contains')
   .dimensions(['date'])
   .get()
)
  • Total clicks: 893,691- (still the same number ;) 

  • Clicks for this path: 307,587

  • Click without this path: 568,781

The total is now closer (876,368), but we’re still missing around 17,000 clicks (roughly 2%). 

API total clicks

The API is wonderful (and straightforward to use), but the data is not always accurate. Enter BigQuery. 

GSC export to BigQuery

In March 2023, Google announced a new functionality: bulk data export from Google Search Console to BigQuery. The setup is straightforward: 

  • Create a Google Cloud project 

  • Set the export in Google Search Console 

bulk data export from Google Search Console to BigQuery
  • Data will then be added daily to your project in BigQuery.

BigQuery

You need to understand that this integration is not retroactive, though. While the GSC API has a 16-month data retention period, BigQuery has no limitations but you won’t be able to get data before this setup. 

When you look at the data structure, you will quickly realize that we have plenty of fields not available in the API, which is great. Although, and to be fair, most of them won’t be useful for your project unless you are in a specific industry. 

GSC data structure

We also finally have access to the metrics for the anonymized queries, which should, in theory, remove the discrepancies we observed with the API. 

API discreprancies

Let’s run a couple of queries to check that. As data is stored in BigQuery, you’ll need to write SQL, use connected Sheets or use the GSC library I mentioned before, which has built-in functions to query directly your BQ dataset. Remember when we got the number of clicks for January 2024 (which was 893,691 with the API)? Let’s get that info again!

First, be aware that, unlike the API, querying your data in BigQuery has a cost. It is usually low, but if you run the following query, you’ll get an estimate of what it will cost you.

(
   conn
   .query
   .range('2024-01-01',"2024-01-31")
   .get()
   .group_data_by_period('M')

)

In my case, €0.0016, which is more than acceptable. Now, let’s run the actual query: 

(
   conn
   .query
   .range('2024-01-01',"2024-01-31")
   .get()
   .define_estimate_cost(value=False)
   .group_data_by_period('M')

)

The total: 923,847 clicks, 3.4% higher than what we had in the API. To be honest, I didn’t expect a difference here, but 3.4% is not enough to justify using a paid product vs. a free API. 

Let’s move on with the examples we used for the API: branded and non-branded queries. 

  • Branded: 50,189 clicks (3.2% difference vs API) 

  • Non-branded: 523,744 clicks (35% difference

  • Anonymized: 349,914 clicks (-25% difference)

  • Total: 923,847 clicks  

We can see one key difference here: while we still have anonymized queries, the percentage is lower, and we can access better data. If you rely on accurate keyword data, BigQuery is definitely the data source you need to use.

BigQuery is definitely the data source you need to use.

Moreover, for this example, I’ve actually found that the API returned around 40,000 unique queries where the site was shown, while BigQuery showed 350,000. A huge difference that I was not expecting. 

This data exhaustively can impact some analyses we may want to perform. Let’s, for instance, calculate the CTR yield curve for the same set of pages using the API and BigQuery, and let’s see the difference. 

CTR yield curve

As you can see, there are some significant differences between both data sources, which could impact our project prioritization based on the SEO impact we can forecast for any given action. 

This difference may surprise some of you, but it has a simple explanation. The API underreports queries and prioritizes the ones with the most clicks in the API. Let’s look at an example. 

We recently launched a new set of pages. If I look at the data for these pages by query

  • With the API I only get 24 clicks and 252 impressions for 47 queries. The average CTR would be huge: around 10%. 

  • With BQ I get 79 clicks and 44,274 impressions for 5384 queries

The data exhaustively changes completely the way we look at this information. With the API, we may think that our project is a minimal part of the markets (which can make sense for a small player), while BQ tells us that the potential is 200 times higher and that we could expect to generate more traffic in the medium term if we rank better.

API vs BQ: which one to use?

We’ve seen the difference between both data sources, but here comes the one million-dollar question: which one should you choose? 

My take is that even if the API has some flaws that I’ve highlighted, this still should be the default data source you use. 

First, it is free and has native integrations with most visualization tools, such as Looker Studio. BigQuery is usually cheap to query, but the integration is more complex, and for big web properties, costs can rack up quickly, especially if you work with raw tables. 

The data discrepancies we highlighted before could be an issue only if you’re looking for accurate absolute numbers and if you’re dealing with big web properties. You can live with the inconsistencies in most cases, to be honest. You should be aware of them, though. 

Antoine Eripret
Antoine Eripret

Article by

Antoine Eripret

Antoine has been working as an SEO specialist since 2016. Follow 3 and a half years of experience at an agency in Barcelona, he is now SEO Lead at Liligo. Antoine is most interested in the technical aspect of SEO and is a fan of efficient work, so that no time is lost on repetitive analyses or tasks with no added value.

Share on social media

Share on social media

stay in the loop

Subscribe for more inspiration.