Google canonical links and fixes

Robert Crowther Jan 2023
Last Modified: Feb 2023

Summary

Intro

Google decides it’s URLs by some devious method. At mildest, Google process tries to target the best URL. At strongest, Google process is defending against coders who try to click‐grab by generating material by low‐quality automatic methods, or by submitting multiple links for pages.

But the method Google uses to decide URLs is (for good reason) obscured. It takes several factors/techniques to encourage Google to use a link as canonical, and the choice is never guaranteed. Let’s say a website has a genuine URL scheme for good material. If you do not inform Google enough about preferred URLs, then it may pick another. In my experience, Google’s choice of URL is usually a ‘www,xxx’ host and a trailing slash. So not,

treasure.com

but,

www.treasure.com/

It’s easy to understand why Google is cautious and independent when deciding URLs. but this may not be a good representation of a website.

For what it’s worth, and as far as I know, these are the factors that can encourage Google to pick one canonical over another,

A sitemap

Google Help says a sitemap reference is taken as canonical

A meta link for ‘canonical’

Confirmed, Google reads and likes these

Server redirects from one URL to another

Google strongly prefers non‐redirected URLs

My overall experience is that, as advertised, Google will refer to and often respect these hints.

Let’s say you don’t give the hints, so Google chooses a different URL. At first, this may not seem to be important. Google would not list the link if the page was not accessible. The only obvious issue is that the URL scheme is inaccurately represented on the web. Which has a small risk—if the deployment of a site, and so the URLs, is changed, then that could be awkward. But it is a small issue—websites are not supposed to change their URLs, or seldom.

Ah, but there is a bigger issue. Here is the situation. For the sake of friendliness, a website deployment allows different constructions of URL. The most common is to allow URLs with and without a ‘www.’ prefix. Google may take the alternative possibilities as an attempt at duplicating URLs. If so it will mark errors against pages,

Duplicate without user-selected canonical

So then Google will not index the page. Most website builders would regard this as bad. Another problem—errors in URL consistency are not highlighted in the Google console tool, so these errors can be difficult to spot.

A reported error on a URL is a long term issue. See, a ‘rogue’ URL, one that returns an error, is now lodged in Google cache. Likely deliberately, to avoid obsessive contact by coders attempting to promote/delete links, Google provide no tool for URL removal. So the URL will sit in cache, listed as an error, forever.

And a reported error on a URL leads to a wider problem. The rogue URL will block attempts at mass validation. Every time a validation arrives at a rogue URL, it will do what validation does—stop. Nor is there a way to reorder the validations, for example push a rogue URL to the bottom of the list. So now, not only do you have a rogue URL, but that URL will block attempts at validating good URLs. A single issue can jamb the site‐wide search‐engine profile.

Fixing pre‐decided URLs

How do we ask Google to change a canonical URL? I’ve experimented with this, and the process was difficult and extended.

Let’s run through those methods of influence again. What happens if, after URLs have been decided in a way we would not prefer, we add,

A sitemap

Will not fix rogue links. My experience. Only works if the sitemap existed before Google crawled

A ‘canonical’ meta link

Causes a rogue URL to be moved under a further error heading, ‘Duplicate, Google chose different canonical than user’

Server redirects from one URL to another

Can work, or can repeat errors. I once received this incomprehensible ‘error’, ‘Alternative page with proper canonical tag’

Not promising, is it? And most of this is nasty (not DRY). In detail…

You added a sitemap. Google will, if everything else is ok, correctly register new pages. But rogue pages will continue to display a ‘Duplicate without user‐selected canonical’ error. They will not be indexed. Trying a spot crawl with the inspect tool will not work—Google will not update it’s cache. Understandable, because Google cache extends across many sites across the world, so this should take time. But I’ve waited a month for something to happen, and nothing did.

Adding meta‐links is more direct, has an effect within days. Again, new pages will be correctly registered. But ‘rogue’ URLs will start to be moved from a ‘Duplicate without user‐selected canonical’ error to ‘Duplicate, Google chose different canonical than user’ error. This is preferable to the sitemap solution because it shunts ‘rogue’ URLs into a new category of error, so clears the site for validation of good URLs. But it will do nothing for the ‘rogue’ URLs. Again, I’ve waited a month for something to happen, and nothing did.

The final possibility is to introduce redirects. For some sites, this is impossible—there is no server access. Let’s say that introducing redirects is possible. I’ve tried validation and the ‘Inspect’ tool, and I have news. As far as I know, an ‘inspect’ will supply Google with new data, but the update is only treated as a ‘discover’, not a crawl. Until the page is visited by Googlebots from the web, which classifies as a ‘crawl’, the process will not move the inspected URL out of the error block. But a ‘validate’ will not update the information, so may fail again. And validates can take a month to process. So I recommend using both—‘inspect’ first, so information is updated, maybe leave for a few days, then, and only then, ‘validate’ to get a error‐correction crawl. Then… nothing will happen, not for days. My experience, at least a week. I find I can’t argue with Google process over this, because, even if the process decides in favour of a user‐supplied URL, then all cache must be updated. After this time, and all being well, I can confirm the canonical form will be updated and URLs removed from error blocks.

The despairing solution

One way to fix this is to accept Google’s decision. Set the canonical URLs to what Google has decided. You will need some sophisticated software to change individual URLs within a sitemap, and potentially individual URLs within meta‐tags. After that, this solution will create a universe of muddle. A potential nightmare of maintenance. But, if you need a fix right now, one that will work…

Fixes take time

Changing or setting redirects means tinkering with server code. First, you need to confirm the fix works, which can take a week. Even if the results are as you wish, it will take two weeks or more before a full list of URLs is processed. A validate can take a month. This is not a quick job.

Refs

Duplicate URLs, Google Help,

https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls