28 Sep
SEO Article: Avoiding Duplicate Content and the canonical link.
Duplicate Content and Polluted URLs
It stands to reason that having multiple entries in the search results for the same content is not good at all as far as search engines are concerned. Afterall, they don’t want people deploying huge numbers of domains names all pointing to the same content in an effort to ensure visibility in search results. In fact, URLs identified doing this will often be punished by removal from the results.
So the need to avoid this is obvious and the appropriate remedy is clear. However, the problem of many URLs pointing to the same content is much subtler and well worth considering, given the penalties for being ‘caught’.
Googlebot will index http://www.mySite.co.uk and http://mySite.co.uk as different websites.
This also extends to http://www.mySite.co.uk/index.html too. Most web authors probably want all these variations to mean the same thing. However Googlebot sees them as separate, and may well flag your website for punishment as you appear to be providing duplicate content i.e. several URLs for one page of content.
One way to avoid this issue was to use host redirects i.e.301 for permanent changes. This can be problematic with some hosts who do not offer this option, or if your domain name is hosted by one organisation and the web files are hosted elsewhere. It can also be expensive as some hosts charge for this ‘domain mapping’ service.
The best policy is to avoid the ambiguity from the start – choose your preferred URL and stick to it. Once decided upon, it should be used for everything including sitemaps, directory submissions, internal website linking etc. Of course, you cannot be responsible for how other webmasters choose to link to your site, but you do now have additional options. The new option is to use the canonical link tag.
Canonical
<link rel=”canonical” href=”http://www.mySite.co.uk/” />
The canonical tag permits bots to unambiguously link a URL to content. Google considers this tag as a hint rather than directive when indexing. Bing does too but apparently there are differences as to when each SE determines it is appropriate to use it or not. My current impression is that Yahoo doesn’t take any notice of this at the time of writing (2010).
The usefulness of this tag extends beyond the need to clarify a URL. It permits web authors to automagically clean up links that get harvested by search engines. Essentially it permits you to designate the intended page, stripped of session IDs, query parameters etc. For example, if you had the following in your web page’s header:
<link rel=”canonical” href=”http://www.mySite.co.uk/default.asp” />
Then Google (if it takes the hint) will convert this URL:
http://www.mySite.co.uk/dafault.asp?q=zippy&id=909
Into this cleaned up version:
http://www.mySite.co.uk/dafault.asp
But only if Google took your hint.
URL Strategy
- Place a 301 permanent redirect from the http://mySite.co.uk version of your domain to the http://www.mySite.co.uk version. This lets search engines know where the search results should point to.
- Ensure web pages have their own appropriate canonical tag.
Google’s description of this tag can is blogged here and their YouTube video discussing the use of canonical can be found here.
Summary
- Decide on one domain representation for your site and consistently use only that. This includes site submissions, internal linkage, sitemaps and site registrations across the internet.
- Use web server 301 permanent redirects to let search engines know your preferred URL for indexing.
- Use canonical tags as appropriate to cleanse URLs.

