Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Incubator Q&A

Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.

When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site? Question

+8
−0

Codidact has tried importing questions from other Q/A sites like Stack Exchange. However search engines have not liked that. Those questions and answers don't usually get indexed on Codidact, and in some cases they may have even hurt the search engine rankings of the original content that gets posted.

When legally importing large amounts of content from other sites and complying with the license terms to do so, how do you ensure that search engines don't ding the site for doing so?

History
Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

1 answer

+4
−0

Search engines don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use shingling algorithms to detect duplicate and near duplicate pages hosted anywhere on the internet.

Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.

If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.

There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.

Use robots.txt to prevent search engine bots from crawling it

If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because robots.txt is not designed to list thousands of individual URLs.

  • On its own subdomain: https://imported.example.com/posts/1234

    Create a separate robots.txt file for this subdomain at https://imported.example.com/robots.txt that disallows all crawling:

    User-Agent: *
    Disallow: /
    
  • In its own subdirectory: https://example.com/imported/posts/1234

    Use a starts-with rule in your main robots.txt to prevent this subdirectory from getting crawled:

    User-Agent: *
    Disallow: /imported/
    
  • Or with its own unique prefix: https://example.com/iposts/1234

    Use a starts-with rule in your main robots.txt to prevent this subdirectory from getting crawled:

    User-Agent: *
    Disallow: /iposts/
    

Use noindex meta tags to tell search engines not to index it

If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using noindex will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.

See Block Search Indexing with noindex - Google Search Central. There are two ways to implement it:

  • As a meta tag in the <head> section of the page source code
    <meta name="robots" content="noindex">
    
  • As a HTTP header sent by the server as meta data outside the page content.
    X-Robots-Tag: noindex
    

The noindex can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.

Link back to original copy

Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. <a href='https://source.example/posts/1234'> True scraper sites never include links back to the original.

Using a canonical meta tag in the <head> telling search engines about the original is a slightly stronger signal than a link in the page. <link rel="canonical" href="https://source.example/posts/1234">

However in May 2023, Google said that using canonical isn't enough. noindex should be preferred to links to the original. See Google no longer recommends canonical tags for syndicated content.

History
Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »