Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.
Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.
When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site? Question
Codidact has tried importing questions from other Q/A sites like Stack Exchange. However search engines have not liked that. Those questions and answers don't usually get indexed on Codidact, and in some cases they may have even hurt the search engine rankings of the original content that gets posted.
When legally importing large amounts of content from other sites and complying with the license terms to do so, how do you ensure that search engines don't ding the site for doing so?
1 answer
Search engines don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use shingling algorithms to detect duplicate and near duplicate pages hosted anywhere on the internet.
Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.
If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.
There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.
Use robots.txt
to prevent search engine bots from crawling it
If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because robots.txt
is not designed to list thousands of individual URLs.
-
On its own subdomain:
https://imported.example.com/posts/1234
Create a separate
robots.txt
file for this subdomain athttps://imported.example.com/robots.txt
that disallows all crawling:User-Agent: * Disallow: /
-
In its own subdirectory:
https://example.com/imported/posts/1234
Use a starts-with rule in your main
robots.txt
to prevent this subdirectory from getting crawled:User-Agent: * Disallow: /imported/
-
Or with its own unique prefix:
https://example.com/iposts/1234
Use a starts-with rule in your main
robots.txt
to prevent this subdirectory from getting crawled:User-Agent: * Disallow: /iposts/
Use noindex
meta tags to tell search engines not to index it
If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using noindex
will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.
See Block Search Indexing with noindex - Google Search Central. There are two ways to implement it:
- As a meta tag in the
<head>
section of the page source code<meta name="robots" content="noindex">
- As a HTTP header sent by the server as meta data outside the page content.
X-Robots-Tag: noindex
The noindex
can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.
Link back to original copy
Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. <a href='https://source.example/posts/1234'>
True scraper sites never include links back to the original.
Using a canonical meta tag in the <head>
telling search engines about the original is a slightly stronger signal than a link in the page. <link rel="canonical" href="https://source.example/posts/1234">
However in May 2023, Google said that using canonical isn't enough. noindex
should be preferred to links to the original. See Google no longer recommends canonical tags for syndicated content.
0 comment threads