Get Your Blog Out of the Google Supplemental Index
Want to increase your search engine traffic by 20% or more? Get out of the supplemental index.
Back in April Chris Garrett did a great article on the supplemental index problems he was having on his wordpress blog. He got me curious and that's when I found out that the vast majority of my pages was sitting in the supplemental index. Today I have 170 pages cached and only 38 in the supplemental index. I've made another revision of my robots.txt file, so hopefully that number will drop the next time GoogleBot indexes the site.
What is the Supplemental Index?
Lots of SEO masters believe that content that isn't worthy ends up in the supplemental index. While this is certainly true, if you're running a wordpress blog it is more likely that you're simply dealing with duplicate content issues. If you make a post today on a default wordpress setup, there are about 5 different URLs you could type in that would give you the exact same content. You can generally get to the same content via the Category, Calendar, Author, Monthly, and Page archives. Unless you know exactly what you're doing your site is probably heavily cached in the supplemental index.
How do I know if my site is in the Supplemental Index?
There are two really easy ways to figure this out. Chris's method works pretty well but you have to remember or bookmark this google code (site:www.yoursite.com *** -view). My new favorite way is to use Aaron Wall's SEO toolbar. It displays a load of great information about any site you visit including Alexa and Page Ranks.
How to get out of the Supplemental Index
1. Verify your categories don't display an entire post. When you click the Pipe Smoking Category on this site it displays an excerpt of each of the posts in that category. If your blog is setup to display the entire post you're going to have major issues. To fix this you can modify the archive.php. An additional benefit to excerpts is being able to write keyword-rich content that will be displayed on the Category pages.
2. Upload a good robots.txt file to the root. You're more than welcome to use mine as a starting point. It essentially blocks Google from caching all of the different urls that would give it the same content like the calendar or page archives. I let Google cache my content by crawling through the category module. I decided this after watching GrayWolfs great video on How to Make Wordpress SE Friendly.
Now just sit back and wait for GoogleBot to clean up the index. It took about 2-3 weeks before I saw the first drop of 130 pages from the supplemental index. I tweaked the robots.txt file to disallow /author/ urls from being indexed. That should clean it up the rest of the way.
Why should you care?
I saw a 20% increase in SE traffic after my pages were removed from the index.
One last point of interest. I noticed that John Chow has ~1700 pages in the supplemental index. When I asked him about it he said that he didn't care. I checked ProBlogger and he only has 6 pages in the index. I wonder who gets more SE traffic?

Shoemoney only increased his search engine traffic by 1400%


what extention is that, don’t think I’m familiar with it
That is Aaron Wall’s SEO Toolbar ( http://tools.seobook.com/firefox/seo-for-firefox.html )
hmm and how did you activate that dialog popup?
Go to the site you’re interested in.
Right Click > Select “SEO for Firefox” > “Lookup this Page”
It took me a while to figure it out too
not sure how I missed that, thanks.
Cool plugin, thanks for sharing, It’s quite useful for market research!
Wow. What a great post. Gonna check this out today! Thanks for the valuable info!
Doing this tonight. Thanks for the heads-up.
Holy crap, I have 68,400 supplemental!!! That is almost as much as I have Cached! (70,300)
You better make sure your server can handle all the extra love the googlebot is going to throw your way when those come out of the supplemental index. Glad this helps you guys out.
OK looks like the supplementals are all from my forum.. Is there any way to get Google to take notice of the forum? I noticed that forums are getting lower and lower on Google’s priority
It would be easy to get google to stop indexing the forum but I don’t think you want to do that. You probably need to figure out how many different urls the forum makes to a single post then block the extras, so they are supplemental anymore. I haven’t dealt with removing forums from the supplemental but I’m sure you can figure it with a little research. Would love to hear an update what you find.
I am going to try
Allow: /forum/
Disallow: /forum/archive/
Wow, I know this post from John Chow page
thanks for your info, I’ll try it out
Just looking over my site I see that there are 2380 cached and 2380 suplimental… hmmm. I believe there are around 1100 - 1200 posts so I guess 2 lots of each post are cached in some shape or form. Time do do some searching to find dupes I think and remove them.
Would you suggest using the sitemaps folder removal tool or just leaving google to work on it for a few weeks? I tend to have google visit a few times a day so maybe it could be a little quicker.
I think I found the pages causing the issue. I am thinking that the following might sort out my problems.
Disallow: */feed/
Disallow: */trackback/
I just let google do it naturally. If you’re feeling froggy you might use the sitemap to remove them but I’ve read a couple things that said you can totally remove your site from the index going this route.
/feed/ and /trackback/ both need to be Disallowed.
but there are also a 5 different wordpress modules that duplicate content. You need to make sure you take care of those as well.
Absolutely brilliant, Nathan!
Hmm, I have my permalinks set up as “/year/month/post-title/”. How would I go about disallowing the archive while allowing the individual posts? Would it be something like this?
Allow: /2007/01/*
Allow: /2007/02/*
# Allow etc.
Disallow: /2007/01/
Disallow: /2007/02/
# Disallow etc.
Or would I need the disallows first? Sorry for asking, I haven’t tinkered with my robots.txt file much at all, really. Thanks for the help and the great post!
Better yet, I suppose I could just do something like this, couldn’t I?
Disallow: /2007/01/$
Disallow: /2007/02/$
# etc.
Matt,
Did you just bite the dust and restructure your permalinks? Because I am looking at your site now and you are using the /post-name/ structure now. Or, do you have a second site, like me, where you made the mistake of doing the /year/month/day/post-name/ structure.
I am looking at your robots.txt file and it looks to be very descriptive. Can you comment back with a further explanation to the creation of your file so that others can learn more from you?
Also I think you are right,
Disallow: /year/month/day/$
does the trick… but unfortunately to will have to invest a ton of time into creating a full file that disallows all instances of all months within each year, and all days within each month… after it’s all said and done you are looking at 365 entries alone just to cover the days, 12 more line entries for the months of each year and how ever many years you have had your blog up and running… additionally, you have committed yourself to having to constantly keep this file up to date.
I wonder, if you have an established blog using the /year/month/day format, as I have, it it would be more beneficial to bite the bullet and change the permalink structure to /post-name/ or the other common /category/post-name/
The end result will create a ton of 404’s for all the pages that are indexed… however, if supplemental indexing is a concern anyway… what’s the difference???
Ultimately, though… by changing the permalink structure on a popular site… is basically wiping out the entire collection of pages, cats, and posts that are indexed in Google… so the question to ask is, “Is is worth wiping out hundreds of pages and generate a ton of 404’s when these pages suck anyway being in the supplemental index?” versus… investing a ton of time and trouble trying to invent a special robots.txt file that creatively allows for category and individual post page indexing… while disallowing a nightmare of url structures based on year, month and day….
All in all… the ultimately learning here… is when someone launches a wordpress blog, do modify your permalink structure! If you don’t then you are faced with a nightmare of a problem when you eventually do educate yourself on SEO issues. LOL!!
That is the boat I am in now… I have learned from my mistakes on the blog in questions and no longer do this format, however, I am still faced with trying to combat these issues on my original blog. Yuck!!
permalinks really don’t factor into this problem. The permalinks are what you want google to make your primary content. Right now those links are being added to the supplemental index because it finds a dated archive with exactly the same content as the permalink url.
You just need to run a robots.txt file very similar to mine that allows google to crawl the categories to index your content. All other wordpress modules need to be disallowed. Also you need to make your content in the categories show up as excerpts. modify your archive.php file to accomplish this.
Ehm…I am working for a really big website and I discovered that it has 418.000 supplemental results O_O…
Congrats for writing content worthy of JohnChow.com. You hit a big one.
Hi Nathan,
I can’t run a robots.txt that’s very similar to yours because my permalinks are formatted as “/year/month/post-name/”, which won’t be accessible due to the “Disallow: /2007/0*” and “Disallow: /2007/1*” lines.
Your robots.txt file works for you, because your permalinks follow the format of “/category-name/post-name/”, meaning that your individual posts will be accessible despite these lines.
As such, I’m probably better off changing the archives to be an excerpt while allowing access to the date archives unlike your setup.
260,000 pages in supplemental here (a non-blog e-commerce site). Man, i’ll be busy for the next few weeks at least!
i was wondering why our traffic has been going progressively downward - despite us adding a ton of new content every week. we had about 10,000 product pages in total this time last year - and now we have about 150,000 distinct product pages. all good , non-spammy content.
and yet , the google traffic has barely budged, and has actually started to decline. i’ve been scratching my head over this for weeks now - this supplemental stuff certainly looks like the source of the problem.
big Kudos to you sir for this one.
(and as a side note : if this results in a cleaner google index, then we ALL benefit… )
Hey,
thanks for the great info. I have exactly the same issue with duplicate content as one of my sites is built around articles (both free and PLR). So when I started out I got indexed quite quickly and received a lot of traffic, but after a while others set up similar sites with more or less the same content as mine. Now most of my stuff is in supplemental.
Thanks again foryour advice on how to get me out of the supplementals and back into real life.
Tom
Why disallow Googlebot-Image?
1 - I checked Problogger and I doubleclicked on the supplemental line in SEO For FF plugin: Google lists 3770 supplemental pages!
2 - I checked Problogger robots.txt file: it’s almost empty (User-agent: *
Disallow:)
3 - I checked Problogger category display and it shows entire posts!
Is it me or Darren doesn’t care about Supplemental pages???
Well, I implemented some changes about 3 - 4 hours ago (when I last commented here) and I have already seen almost 400 pages come out of supplimental search
Its now 2420 cached and 2050 supplimental. I will have to check traffic status over the next few days.
I have a ton of supplemental that at one time were duplicate content. These are webpages not blog pages. I have had 90% of them rewritten to be unique content, but am still in supplemental hell. How do I get out? It has been over ONE YEAR now
Great post. I never knew I had so many supp pages. I always thought that if you wrote quality content, stay away from scrapinng/copying, you were okay.
As G.I. Joe would say “Knowing’s half the battle” Now I know
On a nutter matter, I think I found a misspeled word in your robots. txt file. It’s the word individual. It’s spelled indididual in the txt file. I’m not the spelling Nazi, trust me on this (or read my blog), I just thought the misspelling might be affecting how the bot indexed your site.
Wow!
A great post…thanks a lot!!!
I had all but 5 pages listed in the supp. Thanks a ton!
Great post Nathan! You’ve been Dugg for such a fine effort.
I am using your robots.txt. better not screw me over!
Great read, thanks
Hey Nathan,
Good post and thanks for the robots.txt
I don’t seem to have /author/ on my default WP, is it standard?
Also doesn’t disallowing /page/ mean your older posts are no longer accessible or are there other pathways? (assuming you’ve blocked all the others you have in your robots.txt.)
For 1 website of mine I went down from 929 to 872 within a few hours and today I went up to again to 1070
Even with the new robot.txt
Ok this tip is proving very useful as my blog’s supplemental links have lessen from 85 to 62 after I copied your robots.txt but I have a confusion
I want Google Image bot to crawl and index my images. In your robots.txt haven’t your disallowed Google image bot? or should it remain as it is to allow Google to index my images?
Fabulous post, Nathan, and a true service to Web society.
Tell me one thing: what modification has to be done to the archive.php file?
Regards,
Lucky
////////////
I’ll be posting a follow-up to this post sometime this week. I’ll discuss in detail the changes need to handle the posts excertps.
Nathan, great post and thanks a million for it!
According to searchtools.com, “robots read from top to bottom and stop when they reach something that applies to them. ”
This means that in your robots.txt file, Googlebot will stop after it processes the “User-agent: *” bunch of directives that occur before the “User-agent: Googlebot” directives.
Perhaps you should put all the directives under the first category.
Cheers
Lucky
/////////////////
Great article Nathan! I dugg it and gave it a stumble. Now I just need to follow what you said to do. Thanks so much!!
Thanks Julie. Much appreciated. If you have any questions implementing anything here please let me know.
Hi, you have disallowed googlebot from indexing pages with .php$, .js$… extensions. why only googlebot and not other SEs like yahoo as well?
A very useful post - I have around 2000 supplemental links and that’s with a robots.txt file - looks like it needs some more work - one thing that needs removing is the vbulletin forums archive links - there’s loads of them, but then that’s been going a lot longer than my blog.
Thanks for highlighting this issue, should hopefully help my site! Cheers!
I have a site that I set up about 2 years ago and being able to change my permalink structure is not an option, otherwise I am looking at losing all my links in Google as well as creating a ton of 404 pages.
My problem is that I left the permalink structure the default way:
Now, I have learn from this mistake through the past year and I no longer set up my permalinks this way. Typically I set up blogs using this permalink structure:
Now, using a robots.txt file on the new way I have been doing the permalink structure is easy. However, I can’t figure out how to write a robots.txt file when part of the URL to the individual post is also part of the directory structure I want to disallow.
As it stands, the object here is to prevent googlebots from accessing any section of your site other than your home page, category section pages and individual post pages.
However, in my situation I need to figure out how to write a robots.txt file that disallows the bots from viewing my:
year/ directory
year/month/ directory
year/month/day/directory
but not disallow my actual post page,
year/month/day/post-page/
How do I go about doing that?
The website is question can be found at www dot blog the internet dot com
Best Regards,
Garry Conn
Second question…
I notice that Problogger.net has somewhat the same permalink structure as my site in question does.
Darren’s solution can be found here:
http://problogger.net/robots.txt
on his robots.txt file we have the following:
Seems pretty simple… nothing to it… so what does this file tell the bots? Does this command tell bots to disallow the entire site? Surely not, because I don’t notice the / trailing slash after the disallow…
So, what is his file all about, and are we over complicating things… because keep in mind, of all the sites that I have seen, his site is not only the most popular but also has the least about of pages in the supplemental index.
Ok. I think I have answered my own question here. However, I still would like info on Darren Rowse’s robots.txt file.
In the mean time, this is what I discovered in Google Webmaster Tools.
If you have a blog that has traditional permalink structures such as this format:
/year/month/day/post-name/
The challenge of writing a robots.tst file that will allow bots to crawl only the category pages and the individual post page can be difficult. So far the best way I have figured out how to do this is by doing the following:
disallow: /2007/$
disallow: /2007/01/$
disallow: /2007/01/01/$
which basically from what I am guessing, tells the robot that you can not index the /2007/ directory but you CAN index sub directories within… further more, the next line I wrote tells the robot that you can not index this sub directory but you CAN index the sub-sub directory, then from there my next line says that you can NOT index this directory but you can index sub directories within.
In other words, these line should tell robots:
You can NOT index the /year/ directory but you can
index the /year/month/ directory… however, when you get there, you can’t index the /year/month/day directory… however, when you get there, you can index all the directories within… thus, being your individual posts.
Am I right or wrong about this assumption? I tested it in the robots.txt analysis wizard in Google Webmaster Tools and it seems to check out.
Looks like you’ve been busy. I’m pretty sure the way you have the Disallows setup in this last reply will work however you may just want to disallow /categoy/ and then allow the dates. I guess if you do that you have to leave the dated archives somewhere on the page so googlebot can crawl it, so this may not be desired.
I would suggest you try either way and in a day or two check out in the webmaster tools how google interpreted your robots.txt file. Remember google only updates the cache ever 30 days so you can test out the robots file without actually removing it from the index.
Good luck and keep me posted on your progress.
Thanks for sharing, I also still on war with supplemental pages. lol
Anybody know how to get your category pages to only show excerpts? Is there a plugin?
http://www.notsoboringlife.com/blogging/displaying-post-excerpts/
As a newbie to blogging/css etc . I’m learning. I’ve read and re-read this page about a dozen times. I’ve changed my permalink structure, done a few disallows and now I wait. Since I have no idea what I’m really doing I pray I dont’ crash the whole thing.
…Thanks for a great article. As I learn I’ll be coming back to reference this and fix any mistakes.
ok now I’ve got a question as I’ve been playing with this all day…when I do a search for a specific string, to see how I come up in google, my feed comes up, but not the actual page.
(here’s an example of a search string. “Will this action move me toward my life goal or distract me?)
so, if I disallow the feed, will the actual page come up in the future? Isn’t the feed better than nothing? thanks for any help!
Yes, if you dissallow google from caching the /feed/ and /trackback/ neither of these will appear when you search for the string. So when the cache is updated you’ll see your actual post showing up. Hope this answers your question.
I checked your site and I would recommend changing your /category/ from displaying the entire post to just displaying an excerpt.
Just goto archive.php and modify the line the_content and change it to the_excerpt.
Nathan, thanks for the help. I did go change to excerpt, but I don’t like the way it looks… it took out photos, and it is hard to see where to click to get the whole post. (I have readers who are not savvy with the internet)….sigh. now I have to go learn some css. More ideas?
A follow up from my comment above: I found a bit of code online and now actually like the way it looks … now to monitor my cached files. Thanks Nathan!
excerpts are the way to go because wordpress has several different urls with the same content. I think you probably found a plugin that allows html in the excepts which will make it look a little better. I’d be interesting which plugin that is
I’ve gone the wrong direction! I’ve got more supplemental cached than before now.. help!
No worries! Don’t change anything yet.
Let it sit for about a month. GoogeBot just did a major update last week, so I think you were changing to much stuff to see a major change. All your pages that are currently in the supplemental index will be removed on the next major update.
Patience.
If you noticed I made some major changes to my robots.txt file and it ended up putting my entire site back into the supplemental index. I’ve made some more changes and will wait another month before I update it again.
Just has I think I got all this SEO cracked, I come across this post. This SEO stuff is never ending to learn.
Great post keep up the good work.
My Tech Blog http://www.britec.co.uk/techblog/
are of today your supplemental is actually higher than cached… any new insights? I’m battling this as well on one site.
Search engine positioning, optimization, and increased website traffic are critical elements of a successful Internet business solution. High visibility of your website can make the difference between driving a high volume of sales leads and targeted traffic to your company’s website or being lost in “cyber space.”
With the burgeoning popularity of the internet, new developmental tools are created daily. With these tools come new challenges, marketing, design, cross-browser transitions, etc. All of these can be a daunting task for those web gurus who aren’t well-versed in the W3 Standards.
I installed the tool and I can tell you the tool is not accurate.
According to the tool I have two and one of them is my main page. lol I noticed every site I entered reported the main website as supplemental.
I tried another one of my sites and it reported every page as supplemental, which is full of bull.
Hi, how do i modify the archive.php ?
how do i know that which pages are in supplyment index and which pages are in main index as now google is not showing supplyment result.
kindly guide
I’ve installed the tool today, but there is no ‘Suplemental’ category - has the version of a tool changed? Did they remove ‘Suplemental’?
THX, Mariusz
how can i edit my roborts.txt in blogger.
please tell me at http://www.indianglitz.blogspot.com
http://www.wwwportal.blogspot.com
Nice information that you have provided. Does anybody have instructions or can point me to a URL that does to modify the archive.php?
I have just started my own blog and I find this info is really useful.
My hat off to you!
I got all information with the help of this tool of seo Firefox except information of Supplemental pages. Please can u tell me how can i get the information about supplemental pages of the site ?
Thanks that’s some useful info for me to take on board and try to improve my blog’s ranking. I’m off to do some work - will report back in 3 weeks to compare my results with yours! Cheers
This is really interesting. I seem to have two robot.txt files which is causing me grief. any suggestions?
Hey, very interesting post, opened up my eyes sort of
There’s one thing that bugs me though: I’ve installed the Firefox SEO plugin and it did show the supplemental index the first time on this page, but now the line “supplemental” vanished from the menu. Doesn’t matter what page I check, I can’t make it reappear. It’s also not (or no longer) in the addon settings menu.
Thank you for this very interesting read. I have been struggling with this exact same problem myself. I’ll have to take your suggestions in mind and see what results come of my site, although my site PC Bytes - Network Consulting is not a blog, it is also suffering from the same problem of having too many pages sitting in supplemental index. I will admit my SEO could be improved a bit. Well thanks for the help!
Hmm
I’m getting the same result as you guys no supplemental is shown in the toolbar, either they’ve moved it or google has dropped the supplemental gameplan altogether.
awesome article. thnx alot.
Informative article. Really helped me a lot
Thanks!
Nice..Thanks for the info..I’ve change the robots.txt hope this really works..
thank a lot..
Thanks for the info.
Am I correct in thinking that the All-in-one SEO pack plugin has fixed this issue without needing robot.txt?
I was just reading John Chow’s ebook and directed me to read your blog post about Supplemental Index. Thanks. I will surely apply this on my blog.
???????? ????, ?????????? ? ???????? ??????????!
I too have been directed to your page from John Chow’s Blog. I am looking at the information in the SEO for Firefox Site Information and I do not see a “supplemental” listing for any of the sites I type in. Has this feature been removed or does this have something to do with my IP being blocked for using the plug in too much? Is this still relevant?
Hi! I was surfing and found your blog post… nice! I love your blog.
Cheers! Sandra. R.
I love your site.
Love design!!! I just came across your blog and wanted to say that I?ve really enjoyed browsing your blog posts. Sign: ndsam
thats pretty interesting - I have never heard of the supplemental index before>?ja
Excellent info provided in the post and tell user how to bring back your site in google
few year back the Google show which page is supplement but now it is not showing like this
Good stuff, Keep continue.
So helpful! I’ll try it!
Hi
Great post thanks and i gather i discovered it some time after you wrote it as i followed to the install plugin etc and it seems the software must have been updated???? there is no Supplemental in the list at all!!
can you shed any light on this - or know of another way to check pages in supplemental
many thanks
Steve