The Ultimate Guide to Finding and Fixing Index Bloat

John Caiozzo

Aug 4, 2016

min read

Blog

SEO

The Ultimate Guide to Finding and Fixing Index Bloat

John Caiozzo

Aug 4, 2016

min read

Blog

SEO

The Ultimate Guide to Finding and Fixing Index Bloat

John Caiozzo

Aug 4, 2016

min read

What is Index Bloat?

Index bloat is one of the most common technical SEO problems that websites, especially ecommerce sites, face today.

It occurs whenever Google indexes pages that should not be indexed. Index bloat can happen to almost any website as a result of pagination issues, having secure and non-secure versions of your site indexed, or even allowing your WordPress blog categories, tags, and archives to be indexed by Google.

Ecommerce sites are the most common culprit of index bloat. Most ecommerce sites feature filter lists or widgets that allow users to quickly find products that meet their specifications. For example, Amazon has filters for “Average Customer Review” or “Lowest Price.” However, filters like these typically create new pages once the specific parameters are selected by the user. When Google visits a website it typically follows all of the links and buttons on a web page, including the filters, which can cause it to index thousands of pages that offer no unique value to Google or users.

Why is Index Bloat a Problem?

Index bloat can be a huge SEO problem for your website. For one, it’s confusing to search engines, especially when there are potentially thousands of variations of a single product category. When search engines come across a website with index bloat, they can struggle to understand which page is the most relevant to searchers and may serve up non-relevant results - the thing Google wants to avoid at all costs.

Index bloat also causes duplicate content problems, as these pages typically don’t have unique content or meta information. Remember, this is what Google says about duplicate content:

"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin."

Although duplicate content isn't grounds for Google to try and get you, it isn't doing your site any favors. In fact, it's far better to make your content and meta info unique, as Google prefers to show pages that will offer users useful content they can’t find anywhere else. This all contributes to a better user experience.

Index bloat can also drain crawl budget and frequency, preventing Google from crawling and indexing the important pages and sections of your site. If Google focuses on the wrong pages, it could lead to a major drop in rankings, traffic, and ultimately, conversions.

How Do I Know If My Site Is Experiencing Index Bloat?

If you suspect index bloat is the culprit behind a recent loss in rankings, there’s an easy way to find out. One indication of index bloat is an excessive amount of indexed pages - a number far higher than the number of pages you think Google should have indexed. If your index has recently experienced any fluctuations, you could be a victim.

Go to Google Search Console and click on “Index Status” under “Google Index.” You may see something like this:

In this particular example, we noticed a rapid increase in the number of indexed pages beginning at the end of April.

Finding index bloat typically isn’t this simple, however, and usually requires more investigation as to whether it’s really happening or not. Websites might not have any recent fluctuations in their index size, or they might not have a suspicious amount of pages indexed. In these cases you can proceed with your investigation by conducting a site: in Google.

Here’s an example we did for Forbes:

By using the site: operator you restrict your search purely to the website specified. In this example, you can see that there are approximately 1,300,000 pages from Forbes that have been indexed by Google. (It’s important to note that the index numbers in Google Search Console and Google.com typically don’t match up, but they’re close.)

So now that we’ve conducted a site: search, we have to go through each page of Google’s search results to find a common theme in parameters or pages that may be causing index bloat. Sometimes you can speed up this process by skipping ahead to the last pages of your search results, as Google typically stores the least relevant results on the last pages. Like so:

In this case, we found nearly a hundred pages that Google had indexed from pingdom.com where page speed tests had been saved. These pages don’t add any value to Pingdom in regards to SEO, as they have no unique title, meta information, or content (besides page load time statistics for domains). These are the kinds of pages you want to look out for in your Google Index because they needlessly increase the size of your index, drain search engine crawl resources, and confuse search engines.

Fixing Index Bloat

Now that we’ve identified some problem pages, we can prevent search engines from indexing these pages through several different methods, thereby reducing your site’s index bloat. It’s important to note that although sometimes just one of these methods can be used, larger websites may require a combination of them to reliably fix the problem.

Meta Robots Tag

The meta robots tag is easily one of the better options to quickly cut down on index bloat, as it takes precedence over your robots.txt, pagination, and canonicalization. The meta robots tag can be used to tell search engines explicitly which pages they are and aren’t allowed to index. When you run across a page type that shouldn’t be indexed, all you have to do is simply add the following code to your header:

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

(Note: in some cases this may need to be done programmatically.)

By specifying “NOINDEX, FOLLOW” you are telling search engines that they shouldn’t index the page, but they are free to follow any links on that page. This ensures that search engines can still access the rest of your site without indexing the page itself.

Robots.txt File

Your robots.txt file can be used to tell search engines and other robots which areas (or parameters) of your website they are and aren’t allowed to crawl.

As seen above, parameters and URLs can be blocked using the “disallow” directive. However, it’s important to note that when you block Google using your robots.txt file it’s still possible for your site to be indexed.

We know what you’re thinking: “Wait, what? I thought that by using ‘disallow’ Google was blocked from those pages!”

That’s almost true. What a robots.txt file really does is prevent Google from crawling the page, but it’s still possible for the page to be indexed - especially if the page is linked to from another web page that isn’t blocked by your robots.txt file. If you know where these pages are being linked to, you can easily prevent Google from indexing them by making the links to that page “nofollow.”

You may find Google’s URL Removal Tool in Google Search Console useful in removing these pages from Google’s index once the appropriate measures have been taken to ensure they won’t be reindexed again.

Redirects

Some of your index bloat may be caused by old web pages that no longer exist on your site. These may resolve as 404 errors. Over time, Google will eventually drop these pages from its index, but who knows how long that could take? You can expedite the process and give Google an extra nudge to remove these old web pages from your index by 301 redirecting them to the most relevant page. This will also ensure that you minimize the amount of link juice you lose from these pages.

Canonicalization

The canonical tag is used to tell search engines which version of a page is the preferred URL for it to index. It’s especially useful when you have multiple URLs for the same content. Adding a canonical tag to your header indicates which version of the page the search engines should index. Just make sure that all versions of the page, including the preferred page, should point to that same preferred canonical URL.

Pagination

Pagination typically occurs when you have more than one page of product categories, blog posts, or search page results. Because these pages have the same meta information, you have to let search engines know the relationship between the pages so they aren’t identified as duplicate content.

Adding pagination markup will also reduce the number of these pages being indexed because search engines will better understand the relationship between pages and will know which ones should be indexed or not.

Adding pagination to your headers on these pages is pretty simple. For example, if you have a page such as http://www.example.com/blog?category=seo&page=2 then you would add the following tags to your header:

<link rel="prev" href="http://www.example.com/blog?category=seo&page=1" /> <link rel="next" href="http://www.example.com/blog?category=seo&page=3" />

URL Parameter Tool

The URL Parameter Tool within Google Search Console can be used to tell Google what your URL parameters do to the content of your pages. This tool only affects Google’s Search results, so it should only really be used when the previous methods have failed or are not viable options. Like many of the methods listed in this article, you must be very careful not to accidentally exclude URL’s that should be indexed or specify incorrect behaviors for parameters, as this can negatively impact your SEO efforts.

Within the URL Parameters Tool, Google classifies your parameters into two general categories – active parameters and passive parameters. As you probably guessed, Active parameters change what is displayed on a page, whereas, Passive parameters have no impact on the content displayed on the page (UTM Source, Session ID’s, etc…).

Several actions can be associated with a particular Active Parameter such as Paginating, Translating, Sorting, Narrowing, and Specifying. You can also specify several options regarding which URL’s and Parameter values are targeted as well. If you are not already familiar with the tool, it’s strongly recommended that you read Google’s documentation so you thoroughly understand what each action does.

URL Removal Tool

Google’s Index can be rather stubborn sometimes. Even after trying some of the methods above, you can still find pages in Google’s Index that just shouldn’t be there. This happens most often when a page is blocked using robots.txt and Google indexes it anyway because it is linked to from another page on your site. Adding a nofollow tag to that link can prevent this from happening, but even then you may find that the pages aren’t removed from Google’s SERPs. Frustrating, no?

In situations like this, you can always fall back on the URL removal tool in Google Search Console. Using this tool lets you request that Google remove specific URLs from its index. Requests are typically processed within the same day they are requested, so this can be a quick way to knock out any remaining URLs that shouldn’t have been indexed if all other methods have failed.

It’s important to note that this is a temporary measure; if you haven’t taken any measures to prevent these pages from being indexed again in the future, then they will return to Google’s index when Google crawls your website in the future.

Fixing Index Bloat Recap

You now have the tools and knowledge to not only find but solve the issue of index bloat. Now you should take a look at your own website and see if it’s experiencing the symptoms. Once you’ve identified the problem, use some or all of the following methods to fix it:

Meta robots tag
Robots.txt file
301 redirects
Canonicalization
Pagination
URL Parameter Tool
URL removal tool

Whether through a combination of these methods or all of them, you should be able to present your site to Google in a way that satisfies their requirements and earns you the rankings you deserve.

Note: The opinions expressed in this article are the views of the author, and not necessarily the views of Caphyon, its staff, or its partners.

Article by

John Caiozzo

John Caiozzo is an SEO Analyst at SEO Inc., one of the top Search Engine Optimization companies in the world since 1997. John specializes in creating advanced technical SEO solutions and strategies to drive more traffic and conversions to client websites.