What is Duplicate Content?
Duplicate content refers to text that is repeated on more than one web page either on your site or across different sites. Google takes the issue of duplicate content very seriously. In fact, sites with duplicate content were a key target of Google’s Panda algorithm update. As far as Google is concerned, the more URLs you build for the same page, the more it appears that you are trying to game the system, and the more likely you are going to be filtered out of the search results. The bottom line is that every different URL the search engines see on your site should display substantially different content.
Google puts duplicate content into the same category as low quality content, and it’s fair to say that if you don’t have lots of original, remarkable and unique content, you will not rank high in the Google search results. If you have content that exactly matches the content on a high authority site, your webpage will be filtered out of the search results even if it is relevant to the search query.
Issues that duplicate content can create include
- If the search engine discovers two or more versions of the same content, which does it filter out? Which one does it show? For example, if you have posted print-only versions of webpages, these can end up getting displayed, rather than the HTML version if you’ve not handled this correctly.
- Links to duplicate content pages represent a waste of link juice. This means that if the homepage has 100 natural inbound links, the major search engines gives the duplicate pages credit separately to each page, not in a combined manner.
- The search engines will dilute the metrics of the webpage such as TrustRank, PageRank, authority, etc. by sharing them between multiple versions of the same page;
- Search ranking can be impacted negatively.
Note that the search engines may consider two pages duplicates even if only part of the page is duplicated. For instance, if you use the same headings, page title, web copy and Meta description tags on multiple pages, the search engine spider will interpret them as duplicate pages.
Google has confirmed that it doesn’t have a penalty for duplicate content unless the intent of the duplicate content is to be deceptive and manipulate search engine results. However, it is hard to imagine that duplicate webpages would be able to rank high on Google. So, even though the page might not be explicitly penalized, the fact that the site will not rank high in these circumstances means that the site might as well be penalized. To find out whether you have duplicate content on your site, use Copyscape, which is the best known online duplicate content checker tool.
Common Causes of Duplicate Content
Printer-friendly pages are separate pages specifically designed for printing and they are a common cause of duplicate content. Some sites usually offer two versions of the same page on their website for the convenience of their readers: an HTML one to read online and a text-only version that can be printed. The printer-friendly page has its own URL, and even though it might not have some of the bells and whistles of the HTML version, to the search engines it’s an exact duplicate of the HTML page because it is the content that really counts.
Dynamic Web Pages
A dynamic web page is a user-generated web page that displays different content each time it is created. Dynamic content changes frequently, based on the environment or situation. For example, the page may change with the time of day, the user that accesses the webpage, or the type of user interaction.
Dynamic sites create pages which are similar (or very similar) in content with just different URLs. It’s like having one page with a lot of different URLs pointing to it. The Session ID is an example of a dynamic web page that causes duplicate content. A session ID is a unique number that a website’s server assigns a specific user for the duration of that user’s visit (session).
Examples of Sessions IDs
The session ID can be stored as a cookie, form field, or URL. Every time an Internet user visits a specific website, a new session ID is assigned to each page’s URL as the user traverses the site. Closing a browser and then reopening and visiting the site again generates a new session ID and thus a different URL for the same page, creating what appears to be duplicate content. Even though the page itself has not changed, the varying parameters showing at the end of each URL causes search engines to think they are separate pages. In addition, some URLs with session ids expire, leaving dead links. So if Search Engine indexes this link, the next time someone clicks on the indexed link, the user will get an error page.
If your web site uses a slogan, you probably want it to be displayed on every web page throughout your site. However, repeating your company slogan in HTML text throughout your site can cause duplicate content issues because the same lines of text would be repeated on various pages throughout the site.
Duplicate Content in eCommerce Systems
Duplicate content occurs in eCommerce systems when several different URLs are generated for the same product. For example, retail sites may divide the list of items in a large product category into multiple pages. The same product can appear on a special seasonal sales page in addition to its regular listing or if you have separate categories for things like “all shoes” and “men’s shoes”, the same product may be displayed on different URLs.
“Men’s shoes” would fit into both categories and could be reached via two separate URLs, effectively creating duplicate content. Since your site features the same content on different URLs, this can be interpreted by the search engines as a deliberate attempt to boost your SEO through illegitimate means.
For instance, an e-commerce site featuring fashionwear might have its content listed in various ways. A pair of jeans might be listed under ’summer wear’, ‘casual wear’, ‘menswear’, and so on, and be available for on-site searches with those options. A search engine wants to only list pages its index that are unique. Some search engines may decide to combat this issue by cutting off the URLs after a specific number of variable strings (e.g.: ? & =).
For example, consider the following URLs:
All three of these URLs point to three different pages. But if the search engine purges the information after the first offending character, the question mark (?), now all three pages look the same:
Now, you don’t have unique pages, and consequently, the duplicate URLs won’t be indexed. If you use dynamic content, it is imperative that you address the issues caused by dynamic content if you want your site to be indexed.
Pagination on e-commerce systems can also cause duplicate content issues because they can display the same title tag and meta description but have different URLs. For example, if your website visitors can search by category (e.g. mens shoes) for your products. That mens shoes category is well stocked, and a search returns 5 pages. The first few might look like this:
First page displayed in search: www.coolshoes.com/mens-shoes/black/3
Second page: www.coolshoes.com/mens-shoes/mens-shoes/black/3?page=1
Third page: www.mycoolsite.com/category/mens-shoes/black/3?page=2
and so on…
Each of those pages have duplicated meta title tags and meta description tags, even though they have different products.
International organizations or companies with multiple websites will repurpose the same content for various locations. For example, a large property company with a number of brokerage offices may offer the same website template to all of their brokers. Such templates are usually customized only with a different city name and local property listings that matches their locality.
Similarly, a national product manufacturer may give local representatives and affiliates their “own” sites, but all of them would have the same standard template and content. This leads to some of these websites producing thousands of duplicate websites all promoting the same thing and, according to Google, offering nothing of value to the Internet community.
Press release sites were also heavily impacted by the Panda 4 update with sites like PRWeb, PR Newswire, Business Wire and PRLog losing 60% to 85% of their search visibility overnight. The problem is, press releases are published on multiple press release and news websites for the express purpose of being picked up and syndicated across thousands of partner sites. This makes them a prime target for the Panda update.
The big upside of syndicating your content is the exposure and increased traffic that you receive as a result. The potential downside is that your content is duplicated across the web. An RSS feed is a typical example of content syndication. RSS is a method of distributing links to new content in your website, and the recipients are people that have subscribed to your RSS feed.
In addition, you can syndicate your articles to other websites or article directories such as ezine.com. However, the very real problem here is that you’ll now have multiple URLs for the same article, which makes your site vulnerable to the Panda algorithmic penalty.
Note that having the same content on different top level domains which represent different countries is not considered to be duplicate content that could get penalized by Google. Matt Cutts addressed this issue in the video below.
How Many Duplicate Pages Does Google Think You Have?
You can find out how many of your web pages are currently indexed, versus how many pages Google considers to be duplicates:
In the Google search query box, type site:yourdomain.com and then click Search.
On the results page, scroll to the bottom and click the highest page number that shows (usually 10). Note that doing this can sometimes cause the total number of pages to recalculate at the top of the page. Notice the total number of pages shown in “Results 1 – 10 of about ###” at the top of the page. The “of about ###” number represents the approximate total number of indexed pages in the site. Navigate to the very last page of the results. The count shown there represents the filtered results. The difference between these two numbers most likely represents the number of duplicates.
Note that Google doesn’t actually display all of the indexed pages and omits duplicates. To see all of the indexed listings for your site, navigate to the very last results page of your [site:] query and click the option to repeat the Search with the omitted Results Included. Note however, that Google will only show up to a maximum of 1,000 listings.
You can use the free Search Engine Saturation tool available from Acxiom Digital to discover the number of indexed pages in Yahoo! and Microsoft Bing.
Identifying Duplicate Content on Your Site
To check whether you have any duplicate content issues on your site, start with the page title and Meta description tags on your site using the Screaming Frog SEO Spider tool. At the top where it says enter URL to spider, enter your home page URL here and click start. In the results, click on the page titles tag. And under the filter, click on duplicate. What this does is it automatically organize your pages to see if there are error or any duplicate title tags.
If the tool has identified any duplicate tags, you’ll want to look at the address that it’s saying the duplicate title tags are on to check whether these addresses are either content pages or product or category pages. If they are, you’ll want to go back and make sure each title is unique.
You’ll want to repeat the same process for the Meta description tag. Click on Meta description, filter by duplicate, and you can see whether or not you have duplicate Meta descriptions on your product category or blog post pages. And if that’s the case, you want to go back and fix them. Again, you definitely want to check out the actual web page to check that you do have duplicate content issues.
Once you’ve identified and fixed any duplicate content issues you might have had with your Page title and meta description tags, it’s time to check to see if your duplicate content within your site. If you have a dynamic ecommerce site, duplicate content is a common issue on their product and category pages.
Start by signing up for a premium subscription at www.copyscape.com and do a Batch Search. You need to enter your sitemap URL for Copyscape to crawl each URL in your sitemap and analyze it against the rest of the web. It will identify all cross-domain duplicate content and rank it by “risk factor.” You will be able to export the report to a CSV file and sort the columns on the basis of how you want to analyze the data.
The service is pretty affordable for small to medium sized websites. However, for larger sites, it can get a little pricey.
Tip: Be sure to remove any links to external websites, or pages that you know do not have duplicate content after Copyscape crawls your sitemap, but before your pay for their service. You don’t want to pay to have pages crawled when those pages do not have a problem.
The SiteLiner Tool
The SiteLiner tool is the quickest way to identify and fix internal duplicate content issues on your site. If you have a site under 250 pages the service is free. Anything over 250 will cost a penny per URI.
To use the tool, start by putting in your homepage URL into the box provided. Once you’ve entered in your domain name, you’ll be taken to a list of pages hosted on your site in a summary tab. You’ll want to click on the Duplicate Content tab on the left (where you’ll see your overall percentage of duplicate content).
On the duplicate content screen you’ll see the following columns:
- Match Words – This shows the number of duplicated words that are matched on this page.
- Match Percentage – The overall percentage of matched words versus the total words on the page. You’ll want this number to be very low. If you notice the page has a match percentage of 90 percent, this means that percentage of the content on that page matches other content on your site. Not good.
- Match Pages – The total number of pages that have matched duplicate content.
- Page Power – An estimate of page importance on a scale of 1-100 (with 100 being the most important.)
You’ll then be able to sort, filter and export the data.
You’ll then be sent to an overlay of the page. On the right hand side, you’ll see the matched content become highlighted on the page. This is the content that is matched from the page in question to the duplicated content.
Fixing Duplicate Content Issues
Create Unique Content
Each page on your site should have a unique Page Title tag, Meta description tag, and Meta keywords tag in the HTML code. The heading tags (labeled H#) within the content of each page should also different from other pages’ headings. In addition, your headings should all use meaningful, non-generic words.
If for any reason you have to show a repeated sentence or paragraph throughout your site such as a company slogan, you can avoid duplicate content issues by putting the slogan into an image on most of your web pages.
Select a specific web page such as the home page that you want to rank for that slogan, and leave it as HTML on just that one page so that it can be indexed by the search engine crawlers. This ensures that if anyone tries to search for that piece of content, the search engines can at least find that content on the page you selected. Then everywhere else, since search engines cannot read images, simply create an image that lets just users see the slogan.
The Canonical Tag
The canonical tag was a solution endorsed by Google, Yahoo and Bing as a solution to the issue of duplicate content for larger, more complex sites. If your website generates multiple URLs with the same (or similar) content, you can add a canonical tag that references the main page to the header of each of the various duplicate pages.
In the eyes of the search engines, the following four URLs are four different pages:
If all of these pages have the same or similar content on each of them, then sites with this type of duplication have what industry insiders refer to as a canonical URL problem, and it can have a significant impact on SEO success. The canonical tag was designed as a solution for issues arising from individual webpages that can be loaded from multiple URLs.
Implementation is very simple and looks like this:
<link rel=”canonical” href=”http://www.example.com/mypage” />
You would add this tag to the head section of all of the duplicate webpages to categorize them together. This clarifies that the URL in the tag is the canonical version.
For example, if you have three duplicate pages named:
Let’s say you choose http://www.cards.com/how-to-select-birthday-cards.html as the primary URL, then all three of these pages would get a canonical tag specifying /how-to-select-birthday-cards as the canonical URL. Search engines view the canonical tag as a strong signal that all of your duplicate pages should be consolidated in the search engine index, giving you a single, more powerful page in the index.
You can use robots.txt to block search engine spiders from crawling the duplicate versions of pages on your site. You can also the Robots NoIndex meta tag to tell the search engine to not index the duplicate pages. However, note that Google does not recommend blocking spider access to duplicate content whether with a robots.txt file or the robots meta tag.
Google Search Console: Parameter Handling and Preferred URL
If you’ve verified your site in Google Search Console, the parameter settings within the site configuration settings section of the tool allows you to tell Google to ignore tags that are tacked onto the end of your URLs. Since these tags are creating duplicate versions of the URL. You can specify up to 15 parameters that should be ignored by Google when the site is being crawled and indexed.
Google will then list all of the parameters they’ve found in the URLs on your site along with the suggestions to Ignore or not ignore specific parameters. You also have the option of confirming or rejecting the suggestions and adding parameters that are not listed. You’ll also be able to set a preferred URL – with or without the www.
If you are using a template, it is crucial to avoid creating duplicate content by customizing the template for your own particular situation. This includes the headings, Title tags, Meta tags, body content, and so on. You can also adapt the content to the particular locality where the site is based.
Instead of syndicating the entire article, you may want to consider syndicating just the teasers. Note that if you choose to do this, the article teasers should be unique. In addition, even though your RSS feed may send out lots of copies of the same text, you can avoid a duplicate content problem by only sending out a unique summary of your article, rather than the entire article. In any case, it is important to keep in mind that using RSS feeds to distribute your content is a legitimate activity. It does not mean you are being spammy, deceptive or manipulative, which is what Google is really concerned about.
If you are considering publishing press releases and are concerned about the implication of the Panda update, there is nothing to worry about. You distribute a press release for other reasons than search engine ranking, including branding, public relations, investor relations, and so on.
The Google duplicate content filter does not apply to press releases. It only applies to redundant contents which are created with malicious intentions to game the system and gain high rankings in search engine results pages (SERPs).
According to Google,
“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results.”
Thus, spreading the news about your company to as many places as possible is a legitimate marketing tool.
The 301 redirect command is one of the most widely implanted and recommended solutions for dealing with internal duplicate content issues on a website. Effectively, the command automatically forwards incoming links from one URL to another URL, and informs the search engine crawlers that a particular page no longer exists and has been permanently moved to a new location. If for instance you have two virtually identical web pages on two different URLs, you can 301 redirect one of the files to the version that you want the search engine spiders to index.
With this command, you can remove outdated pages from your website without worrying about losing visitors who still visit those pages. In addition, with redirects you can also resolve duplicate content issues that could seriously damage your rankings on the search engine results pages (SERPs).
How to Implement a 301 Redirect Using a WordPress Plugin
1. Install the Redirection plugin.
2. In the left navigation, go to Tools > Redirection.
3. Input the Source URL and its new home.
4. After you insert the old URL and the new target destination, click on add redirection and your 301 redirect is set.
Distinguishing Canonical Tag & 301 Redirect
Although the canonical tag is similar in implementation to the 301 direct, there are some key differences: Whereas a 301 redirect points all traffic (bots and human visitors), the canonical URL tag is just for engines. This meaning you can still separately track visitors to the unique URL versions. 301 is a much stronger signal that multiple pages have a single, canonical source.
Pagination is the process of linking to multiple content pages through numbered links at the bottom of each web page. For example, content management systems use pagination for product catalogs that have more products than they wish to show on a single page. This type of navigation is one of the least effective solutions available for getting hundreds of content pages indexed because it keeps the title, meta description and other meta info intact on different subpages. This throws up duplicate content issues.
Basically, if you have a large inventory of products, having all of your products on a subcategory page creates a negative user experience. The ideal number is no more than 10 products per page. However, if the thumbnails to the product pages aren’t particularly image-intensive, you could have up to 20, 30, or even 50 links to the products on a subcategory page. Users generally don’t mind some vertical scrolling.
For pagination issues, Google suggests using rel=next, and rel=previous links to indicate the relationship between component URLs. Using this tag not only avoids duplicate content issues and spamdexing, it also helps the search engine spiders understand the site structure. But you can also take it a step further and change the title tags & descriptions per page.
Let’s say you have content paginated into the following URLs:
and so on…
In the <head> section of the first page (http://www.coolshoes.com/mens-shoes/black.html/), add a link tag pointing to the next page in the sequence, like this:
<link rel=”next” href=”http://www.coolshoes.com/mens-shoes/black.html/?page=1″>
Because this is the first URL in the sequence, there’s no need to add markup for rel=”prev”.
On the second and third pages, add links pointing to the previous and next URLs in the sequence. For example, you could add the following to the second page of the sequence:
<link rel=”prev” href=”http://www.coolshoes.com/mens-shoes/black.html/”>
<link rel=”next” href=”http://www.mycoolsite.com/category/mens-shoes/black.html/?page=2″>
On the final page of the sequence (http://www.mycoolsite.com/category/mens-shoes/black.html/?page=3>), add a link pointing to the previous URL, like this: <link rel=”prev” href=”http://www.mycoolsite.com/category/mens-shoes/black.html/?page=2″>
Because this is the final URL in the sequence, there’s no need to add a rel=”next” link.