Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.

Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.

Post History

75%

+4 −0

Incubator Q&A When importing Creative Commons licensed content from another site, how to ensure that search engines won't penalize my site?

Search engines don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use shingling algorithms to detect duplicate and ne...

posted 2y ago by Stephen Ostermiller‭ · edited 2y ago by Stephen Ostermiller‭

Answer

#3: Post edited by

Stephen Ostermiller‭ · 2023-06-17T10:19:06Z (almost 2 years ago)

Copy Link

Raw

Markdown

Search engine don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use [shingling algorithms](https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) to detect duplicate and near duplicate pages hosted anywhere on the internet.
Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.
If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.
There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.
### Use `robots.txt` to prevent search engine bots from crawling it
If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because `robots.txt` is not designed to list thousands of individual URLs.
- On its own subdomain: `https://imported.example.com/posts/1234`
Create a separate `robots.txt` file for this subdomain at `https://imported.example.com/robots.txt` that disallows all crawling:
```sh
User-Agent: *
Disallow: /
```
- In its own subdirectory: `https://example.com/imported/posts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /imported/
```
- Or with its own unique prefix: `https://example.com/iposts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /iposts/
```
### Use `noindex` meta tags to tell search engines not to index it
If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using `noindex` will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.
See [Block Search Indexing with noindex - Google Search Central](https://developers.google.com/search/docs/crawling-indexing/block-indexing). There are two ways to implement it:
- As a meta tag in the `<head>` section of the page source code
```html
<meta name="robots" content="noindex">
```
- As a HTTP header sent by the server as meta data outside the page content.
```sh
X-Robots-Tag: noindex
```
The `noindex` can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.
### Link back to original copy
Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. `<a href='https://source.example/posts/1234'>` True scraper sites never include links back to the original.
Using a canonical meta tag in the `<head>` telling search engines about the original is a slightly stronger signal than a link in the page. `<link rel="canonical" href="https://source.example/posts/1234">`
However in May 2023, Google said that using canonical isn't enough. `noindex` should be preferred to links to the original. See [Google no longer recommends canonical tags for syndicated content](https://searchengineland.com/google-no-longer-recommends-canonical-tags-for-syndicated-content-406491).

Search engines don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use [shingling algorithms](https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) to detect duplicate and near duplicate pages hosted anywhere on the internet.
Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.
If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.
There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.
### Use `robots.txt` to prevent search engine bots from crawling it
If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because `robots.txt` is not designed to list thousands of individual URLs.
- On its own subdomain: `https://imported.example.com/posts/1234`
Create a separate `robots.txt` file for this subdomain at `https://imported.example.com/robots.txt` that disallows all crawling:
```sh
User-Agent: *
Disallow: /
```
- In its own subdirectory: `https://example.com/imported/posts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /imported/
```
- Or with its own unique prefix: `https://example.com/iposts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /iposts/
```
### Use `noindex` meta tags to tell search engines not to index it
If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using `noindex` will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.
See [Block Search Indexing with noindex - Google Search Central](https://developers.google.com/search/docs/crawling-indexing/block-indexing). There are two ways to implement it:
- As a meta tag in the `<head>` section of the page source code
```html
<meta name="robots" content="noindex">
```
- As a HTTP header sent by the server as meta data outside the page content.
```sh
X-Robots-Tag: noindex
```
The `noindex` can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.
### Link back to original copy
Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. `<a href='https://source.example/posts/1234'>` True scraper sites never include links back to the original.
Using a canonical meta tag in the `<head>` telling search engines about the original is a slightly stronger signal than a link in the page. `<link rel="canonical" href="https://source.example/posts/1234">`
However in May 2023, Google said that using canonical isn't enough. `noindex` should be preferred to links to the original. See [Google no longer recommends canonical tags for syndicated content](https://searchengineland.com/google-no-longer-recommends-canonical-tags-for-syndicated-content-406491).

#2: Post edited by

Stephen Ostermiller‭ · 2023-06-17T10:04:28Z (almost 2 years ago)

Copy Link

Raw

Markdown

Search engine don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use [shingling algorithms](https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) to detect duplicate and near duplicate pages hosted anywhere on the internet.
Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.
If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.
There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.
### Use `robots.txt` to prevent search engine bots from crawling it
If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because `robots.txt` is not designed to list thousands of individual URLs.
- On its own subdomain: `https://imported.example.com/posts/1234`
Create a separate `robots.txt` file for this subdomain at `https://imported.example.com/robots.txt` that disallows all crawling:
```sh
User-Agent: *
Disallow: /
```
- In its own subdirectory: `https://example.com/imported/posts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /imported/
```
- Or with its own unique prefix: `https://example.com/iposts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /iposts/
```
### Use `noindex` meta tags to tell search engines not to index it
If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. See [Block Search Indexing with noindex - Google Search Central](https://developers.google.com/search/docs/crawling-indexing/block-indexing). There are two ways to implement it:
- As a meta tag in the `<head>` section of the page source code
```html
<meta name="robots" content="noindex">
```
- As a HTTP header sent by the server as meta data outside the page content.
```sh
X-Robots-Tag: noindex
```
The `noindex` can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.
### Link back to original copy
Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. `<a href='https://source.example/posts/1234'>` True scraper sites never include links back to the original.
Using a canonical meta tag in the `<head>` telling search engines about the original is a slightly stronger signal than a link in the page. `<link rel="canonical" href="https://source.example/posts/1234">`
However in May 2023, Google said that using canonical isn't enough. `noindex` should be preferred to links to the original. See [Google no longer recommends canonical tags for syndicated content](https://searchengineland.com/google-no-longer-recommends-canonical-tags-for-syndicated-content-406491).

Search engine don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use [shingling algorithms](https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) to detect duplicate and near duplicate pages hosted anywhere on the internet.
Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.
If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.
There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.
### Use `robots.txt` to prevent search engine bots from crawling it
If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because `robots.txt` is not designed to list thousands of individual URLs.
- On its own subdomain: `https://imported.example.com/posts/1234`
Create a separate `robots.txt` file for this subdomain at `https://imported.example.com/robots.txt` that disallows all crawling:
```sh
User-Agent: *
Disallow: /
```
- In its own subdirectory: `https://example.com/imported/posts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /imported/
```
- Or with its own unique prefix: `https://example.com/iposts/1234`
Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:
```sh
User-Agent: *
Disallow: /iposts/
```
### Use `noindex` meta tags to tell search engines not to index it
If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. Using `noindex` will prevent search engines from making mistakes and inappropriately assigning your site as the original owner of the content.
See [Block Search Indexing with noindex - Google Search Central](https://developers.google.com/search/docs/crawling-indexing/block-indexing). There are two ways to implement it:
- As a meta tag in the `<head>` section of the page source code
```html
<meta name="robots" content="noindex">
```
- As a HTTP header sent by the server as meta data outside the page content.
```sh
X-Robots-Tag: noindex
```
The `noindex` can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.
### Link back to original copy
Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. `<a href='https://source.example/posts/1234'>` True scraper sites never include links back to the original.
Using a canonical meta tag in the `<head>` telling search engines about the original is a slightly stronger signal than a link in the page. `<link rel="canonical" href="https://source.example/posts/1234">`
However in May 2023, Google said that using canonical isn't enough. `noindex` should be preferred to links to the original. See [Google no longer recommends canonical tags for syndicated content](https://searchengineland.com/google-no-longer-recommends-canonical-tags-for-syndicated-content-406491).

#1: Initial revision by

Stephen Ostermiller‭ · 2023-06-17T10:02:51Z (almost 2 years ago)

Copy Link

Raw

Markdown

Search engine don't want to index multiple copies of content. They prefer to index content at its original and official URL only. Search engines use [shingling algorithms](https://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html) to detect duplicate and near duplicate pages hosted anywhere on the internet.

Unfortunately, that means that mass importing content from elsewhere isn't going to be a good SEO strategy. Even when you are allowed to host content from elsewhere and comply with all the terms of the Creative Commons license, search engines are going to detect that this content is copied and refuse to index it.

If the amount of content that you are importing far outweighs the amount of original content on your site, search engines may even decide that your entire site is low value and refuse to index any of it. Search engines refer to sites that have lots of content taken from other sites as "scraper sites." Search engines usually refuse to index anything from a scraper site. True scraper sites won't be legally using licensed content, they outright steal the content. Having the majority of content on your site taken from elsewhere makes your site look a lot like a scraper site.

There are three technical measures that you can use when importing content to prevent search engines from penalizing your site because of it.

### Use `robots.txt` to prevent search engine bots from crawling it

If you are importing far more content than you have original content, it is a good idea to prevent search engines crawlers from seeing it at all. To implement this, you will need to host the imported content in a separate section of your site because `robots.txt` is not designed to list thousands of individual URLs.

- On its own subdomain: `https://imported.example.com/posts/1234`

Create a separate `robots.txt` file for this subdomain at `https://imported.example.com/robots.txt` that disallows all crawling:

```sh
User-Agent: *
Disallow: /
```
- In its own subdirectory: `https://example.com/imported/posts/1234`

Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:

```sh
User-Agent: *
Disallow: /imported/
```

- Or with its own unique prefix: `https://example.com/iposts/1234`

Use a starts-with rule in your main `robots.txt` to prevent this subdirectory from getting crawled:

```sh
User-Agent: *
Disallow: /iposts/
```

### Use `noindex` meta tags to tell search engines not to index it

If search engines can crawl the imported content, they appreciate you telling them not to index the content rather than making them figure it out on their own. See [Block Search Indexing with noindex - Google Search Central](https://developers.google.com/search/docs/crawling-indexing/block-indexing). There are two ways to implement it:

- As a meta tag in the `<head>` section of the page source code
```html
<meta name="robots" content="noindex">
```
- As a HTTP header sent by the server as meta data outside the page content.
```sh
X-Robots-Tag: noindex
```

The `noindex` can be removed if the content is edited enough on your site or if it gets enough new original answers on your site.

### Link back to original copy

Finally, just complying with the terms of the Creative Commons license and linking back to the original URL for the content is often enough to avoid the worst search engine penalties. `<a href='https://source.example/posts/1234'>` True scraper sites never include links back to the original.

Using a canonical meta tag in the `<head>` telling search engines about the original is a slightly stronger signal than a link in the page. `<link rel="canonical" href="https://source.example/posts/1234">`

However in May 2023, Google said that using canonical isn't enough. `noindex` should be preferred to links to the original. See [Google no longer recommends canonical tags for syndicated content](https://searchengineland.com/google-no-longer-recommends-canonical-tags-for-syndicated-content-406491).

Communities

Post History