How to Leverage Seemingly Unstoppable Content Scraping

Someone scraped your content, and you need to take them down. I bet you will try to:

  1. File a DMCA complaint
  2. Report to Google
  3. Block his IP address
  4. Report to his web host
  5. Report to his ISP

But for how long can you do that? You may be able to do it for dozens, or even hundreds of scrapers. And what if you get thousands or more? It’s ultimately not possible to keep up.

how to leverage content scraping

A well-known site can have millions of scrapers. The more you rise in popularity, the more you get scraped. And you know what, you can’t stop ALL of them. No matter what technique you use, there will always be an alternate way for copying your content.

Take Moz for example. Each of their posts gets scraped hundreds, even thousands of times.
For Ann Smarty’s latest post on YouMoz, which got promoted to Moz’s main blog, I have found 2620 scrapers by Apr 17th 2015 – only two months after publishing. Here is the search query:

ann smarty moz article search query

If a single post is scraped 2,620 times in just a couple of months (and increasing daily), requiring months of effort to clean up, I’m afraid that taking down all Moz’s scrapers would take decades!

Is that [scraping] a valid SEO concern?

Getting scraped means that your original content will also appear on your scrapers’ websites. Googlebot will see the same version of text across multiple websites, and your content will get labeled as “duplicate”.

You are however encouraged to believe that Google and other search engines are smart enough to identify which is the original content and which is scraped. Yet there are no clear indications of how they would handle it.

What we truly know is that:

1. Scraped content can outrank you!

Google’s algo is not really foolproof and you can actually lose business to your scrapers. For example, The Verge’s 8000 word original content got outranked by The Huffington Post’s scrape. See The Verge’s former editor-in-chief, Joshua Topolsky’s reaction:

What’s most egregious about this @HuffingtonPost scrape is its theft of our SEO on title and text. Google “death of the american arcade” – Joshua Topolsky (@joshuatopolsky) January 23, 2013

For this kind of misunderstanding, Google suggest submitting a scraper report to them whenever you find scraped content outranking you.

google scraper report

But if you keep track of your keyword rankings and the top sites ranking for your targeted keywords, you should be able to easily identify if there are any scrapers outranking you, and file the scraper report immediately with very little effort.

2. Attribution is still required to avoid trouble

Whether we’re talking about duplicate pages on your own website or syndicated content on third party sites, Google recommends to always help them identify the original piece using the rel=”canonical” link element, or through link attribution.

google recommendation

So unless you get credit for your content from your scrapers, you might be subject to a duplicate content penalty.

However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. […] In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved.

To sum it up, you have two options available:

  1. The “do nothing” approach. Let search engines handle them. (Risky)
  2. Reduce the risk by making scrapers link to the original content. (Recommended)

But how? – you may be asking. This is precisely what I’ll be sharing with you in this article – here are some of the ways for finding and leveraging scrapers for link building. Enjoy!

How to find scrapers

1. Google Search

The easiest way to find scrapers is through Google Search. Just use the “allintitle:your original post title” query and it will give you all the pages using your article title. Usually scrapers copy content almost blindly or use automated content scraping tools, so most of them will get caught using the allintitle method.

However, there are also scrapers who completely change the title. To catch them you will need some of the other methods I’ll cover in this post.

how to find scrapers

2. Plagiarism Checkers

Plagiarism checking tools are the next best thing you can use to find scrapers. They are quite straightforward to use and will enable you to catch the smarter scrapers that slightly changed the title or the body of your content.

Copyscape is one of the best plagiarism checking tools on the block.

plagiarism checking tool

3. FeedBurner Uncommon Uses

RSS feeds are largely used for content scraping through automated tools that will instantly copy your content whenever your feed is updated.

Fortunately, if you are a FeedBurner user you can find the list of abusers for your blog under [Analyze] -> [Uncommon Uses].

feedburner

How to leverage scrapers and get links

1. Have the perfect copyright notice

“Would scrapers even care about copyright notice?” – you may ask. And you would be right, most of them probably won’t. Especially those who don’t even realize how serious the consequences for breaking copyright laws are.

But there are still publishers who understand copyright policy and are interested in syndicating content without going against the law. If you think about it, syndication is just the well-mannered form of content scraping. Take The Huffington Post, Examiner or Social Media Today for example.

Having a copyright notice will advise them on how to get permission and publish the content under proper attribution. You can place it, as most websites do, in your site footer: “Copyright © 2015 John Doe. All rights reserved.” or you can create an additional page or video on your website with rules for republishing. Whatever works best for you.

2. Use excerpt in feeds instead of full content

As I mentioned earlier, RSS feed scraping is huge. And keeping the full post in your RSS feed means allowing scrapers to copy your entire content. But if you use excerpts in the feed, you avoid any potential duplicate content issues or penalties. By default, WordPress’ excerpts are 55 words long or so, and if your main content is 1000 words long, having 55 duplicate words won’t be a big problem.

Sounds like a better plan, doesn’t it? Then go enable excerpt in your RSS feed.

Wordpress' excerpts

3. Place signature link in feeds

Since RSS scraping is mostly done with tools, you can get credit much easier by trailing a signature link dynamically from each post in your feed.

You can add it manually or use the WordPress SEO by Yoast plugin, which gives you the option of adding a signature link before or after the feed contents.

adding a signature link before feed contents

4. Allow embedding shareable content by attribution

Text is not the only target for thieves. Shareable content like images, videos, or presentations can also get copied. People love them, so they want to steal share them!

Why not get one step ahead of your scrapers and allow them to embed your visual assets? Thus there would be no excuses for not adding a link back to your website, right?

embed visual content

5. Append a “Read more” link to manually copied text

There are still lots of scrapers that manually copy content from your website. For them you can simply append the source link to the copied text using web scripts or tools such as Tynt Copy-Paste.

Some may spot and delete your link, but rest assured that a lot of them won’t bother doing it.

Tynt Copy-Paste tool

6. Have internal links in your content

Those tricky scrapers that copy your content but make enough adjustments to pass any plagiarism check are hard to catch without a broad investigation. Moreover, getting credits from them is even harder. But having internal links in your content can help you a great deal in this case.

Just make sure you have internal links in your copy, and I guarantee that scrapers won’t bother to check and correct all of them.

7. Outreach people who used your content without attribution

Surprisingly (or not) you will also discover authoritative websites scraping your content, and you will definitely want to leverage those. Sort your list of scrapers, select the most authoritative and request an attribution link. It will be worth the effort.

For whomever copies your content, getting caught is something they’re (at least somewhat) afraid of. That is why outreaching to them for links will work.

I found that the most effective approach is “cold and threatening”. Here is an outreach email template you can use:

Subject: An important question about <Scraper site name>

Hi <Scraper name>,

I was looking around the web and found that you’ve republished one of my articles (URL: <the original content URL>. Unfortunately, I couldn’t find the credit link pointing to the original one.

That is an offense, isn’t it? I’m afraid that if you don’t place the credit, our DMCA license might take your site down, or there would be a case against you for breaking the copyright law.

To be fair, I don’t want to see you in trouble. We allow republishing of our content under proper attribution. So if you add a credit to our original content, there won’t be any problems. But you need to do this right away.

Here is the HTML code you can use to link to us:

This post first appeared on <a href=”http://www.example.com/original-content/”>Original content title</a> by <a href=”http://www.example.com/”>Original site name</a>.

Thanks a lot! I hope to get your reply soon, before we are forced to take action.

Regards,
The owner of <Original site name>.
<Original publisher name>

Wrapping Up!

Scrapers are everywhere, they’re difficult to prevent, and almost impossible to stop. We are left with only one option – to help Google and other search engines identify the scrape. This is the recommended way to avoid any bad consequences and incidentally gain benefits!

Note: The opinions expressed in this article are the views of the author, and not necessarily the views of Caphyon, its staff, or its partners.

Author: Abrar Mohi Shafee

Abrar Mohi Shafee is a web enthusiast, a blogger by nature and the founder of Blogging Spell. He finds himself most engaged by blogging, inbound marketing and SEO. You can catch up with him on his Twitter or Google+.

6 thoughts on “How to Leverage Seemingly Unstoppable Content Scraping”

  1. hello , beautiful post …i have a question :

    with your software , how do I calculate the competition of keywords? For example I would like to know the number of results in
    google.it

    dentist venice
    allintitle: dentist venice
    allintext: dentist venice
    allinanchor: dentist venice

    Very Thanks

    1. Just create a Keyword Ranking printable report, making sure you have the “Competition” column added, and you’re set 😉

      Glad you liked the post!

  2. Thanks Abrar, a pretty good overview of the measures you can take against scraping!

    You mention Copyscape, this unfortunately is a very unreliable service. They just don’t have the resources to keep up with the ever growing content ‘production’ we see on the web. But you are right, scrapers regularly change the headline. The way I solve this: just copy paste a unique sentence of your article into Google’s search box and put the phrase between double quotes.

    Also note that the double quotes are more reliable than Google’s allintitle operator. Especially when you don’t use Google.com but a specific country version: often it doesn’t return ANY results.

    Today you see not nearly as much results for Ann Smarty’s article, I think (hope) that’s because of Google’s spam detection.

    1. Hi Arne,

      Thanks for your addition. Indeed, double quotes are useful and glad to hear your opinion about it. Google seems to be good at filtering content spams. That’s why they are continuously filtering them out.

  3. Hi Abrar,

    Content duplication is a dangerous thing happening to bloggers, we put a lot of work to create content but some fellows are easily copying the content and making some changes to it and getting ranking higher than original content, all these options are very impressive to get out of this problem, thank you very much for sharing the information.

    1. Hi Siddaiah,

      Yeah, I’ve seen it happen it front of my eye. I did sourcing some ways to prevent its bad impact and here are some of them I found.

      Glad you liked it. Thanks for your comment.

Leave a Reply

Your email address will not be published. Required fields are marked *