When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site?

−0

Search engines don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use shingling algorithms to detect duplicate and near duplicate pages hosted anywhere on the internet.

Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.

If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.

There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.

Use `robots.txt` to prevent search engine bots from crawling it

If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because robots.txt is not designed to list thousands of individual URLs.

On its own subdomain: https://imported.example.com/posts/1234

Create a separate robots.txt file for this subdomain at https://imported.example.com/robots.txt that disallows all crawling:
```
User-Agent: *
Disallow: /
```
In its own subdirectory: https://example.com/imported/posts/1234

Use a starts-with rule in your main robots.txt to prevent this subdirectory from getting crawled:
```
User-Agent: *
Disallow: /imported/
```
Or with its own unique prefix: https://example.com/iposts/1234

Use a starts-with rule in your main robots.txt to prevent this subdirectory from getting crawled:
```
User-Agent: *
Disallow: /iposts/
```

Use `noindex` meta tags to tell search engines not to index it

If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using noindex will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.

See Block Search Indexing with noindex - Google Search Central. There are two ways to implement it:

As a meta tag in the <head> section of the page source code
```
<meta name="robots" content="noindex">
```
As a HTTP header sent by the server as meta data outside the page content.
```
X-Robots-Tag: noindex
```

The noindex can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.

Link back to original copy

Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. <a href='https://source.example/posts/1234'> True scraper sites never include links back to the original.

Using a canonical meta tag in the <head> telling search engines about the original is a slightly stronger signal than a link in the page. <link rel="canonical" href="https://source.example/posts/1234">

However in May 2023, Google said that using canonical isn't enough. noindex should be preferred to links to the original. See Google no longer recommends canonical tags for syndicated content.

posted almost 2 years ago

CC BY-SA 4.0

2y ago

Stephen Ostermiller‭

72 49 69

Copy Link

Raw

Markdown

History

Communities

When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site? Question

0 comment threads

1 answer

Use `robots.txt` to prevent search engine bots from crawling it

Use `noindex` meta tags to tell search engines not to index it

Link back to original copy

0 comment threads

Communities

When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site? Question

0 comment threads

1 answer

Use robots.txt to prevent search engine bots from crawling it

Use noindex meta tags to tell search engines not to index it

Link back to original copy

0 comment threads

Use `robots.txt` to prevent search engine bots from crawling it

Use `noindex` meta tags to tell search engines not to index it