Someone scraped your content, and you need to take them down. I bet you will try to:
- File a DMCA complaint
- Report to Google
- Block his IP address
- Report to his web host
- Report to his ISP
But for how long can you do that? You may be able to do it for dozens, or even hundreds of scrapers. And what if you get thousands or more? It’s ultimately not possible to keep up.
A well-known site can have millions of scrapers. The more you rise in popularity, the more you get scraped. And you know what, you can’t stop ALL of them. No matter what technique you use, there will always be an alternate way for copying your content.
Take Moz for example. Each of their posts gets scraped hundreds, even thousands of times.
For Ann Smarty’s latest post on YouMoz, which got promoted to Moz’s main blog, I have found 2620 scrapers by Apr 17th 2015 – only two months after publishing. Here is the search query:
If a single post is scraped 2,620 times in just a couple of months (and increasing daily), requiring months of effort to clean up, I’m afraid that taking down all Moz’s scrapers would take decades!
Is that [scraping] a valid SEO concern?
Getting scraped means that your original content will also appear on your scrapers’ websites. Googlebot will see the same version of text across multiple websites, and your content will get labeled as “duplicate”.
You are however encouraged to believe that Google and other search engines are smart enough to identify which is the original content and which is scraped. Yet there are no clear indications of how they would handle it.
What we truly know is that:
1. Scraped content can outrank you!
Google’s algo is not really foolproof and you can actually lose business to your scrapers. For example, The Verge’s 8000 word original content got outranked by The Huffington Post’s scrape. See The Verge’s former editor-in-chief, Joshua Topolsky’s reaction:
For this kind of misunderstanding, Google suggest submitting a scraper report to them whenever you find scraped content outranking you.
But if you keep track of your keyword rankings and the top sites ranking for your targeted keywords, you should be able to easily identify if there are any scrapers outranking you, and file the scraper report immediately with very little effort.
2. Attribution is still required to avoid trouble
Whether we’re talking about duplicate pages on your own website or syndicated content on third party sites, Google recommends to always help them identify the original piece using the rel=”canonical” link element, or through link attribution.
So unless you get credit for your content from your scrapers, you might be subject to a duplicate content penalty.
However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. […] In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved.
To sum it up, you have two options available:
- The “do nothing” approach. Let search engines handle them. (Risky)
- Reduce the risk by making scrapers link to the original content. (Recommended)
But how? – you may be asking. This is precisely what I’ll be sharing with you in this article – here are some of the ways for finding and leveraging scrapers for link building. Enjoy!
How to find scrapers
1. Google Search
The easiest way to find scrapers is through Google Search. Just use the “allintitle:your original post title” query and it will give you all the pages using your article title. Usually scrapers copy content almost blindly or use automated content scraping tools, so most of them will get caught using the allintitle method.
However, there are also scrapers who completely change the title. To catch them you will need some of the other methods I’ll cover in this post.
2. Plagiarism Checkers
Plagiarism checking tools are the next best thing you can use to find scrapers. They are quite straightforward to use and will enable you to catch the smarter scrapers that slightly changed the title or the body of your content.
Copyscape is one of the best plagiarism checking tools on the block.
3. FeedBurner Uncommon Uses
RSS feeds are largely used for content scraping through automated tools that will instantly copy your content whenever your feed is updated.
Fortunately, if you are a FeedBurner user you can find the list of abusers for your blog under [Analyze] -> [Uncommon Uses].
How to leverage scrapers and get links
1. Have the perfect copyright notice
“Would scrapers even care about copyright notice?” – you may ask. And you would be right, most of them probably won’t. Especially those who don’t even realize how serious the consequences for breaking copyright laws are.
But there are still publishers who understand copyright policy and are interested in syndicating content without going against the law. If you think about it, syndication is just the well-mannered form of content scraping. Take The Huffington Post, Examiner or Social Media Today for example.
Having a copyright notice will advise them on how to get permission and publish the content under proper attribution. You can place it, as most websites do, in your site footer: “Copyright © 2015 John Doe. All rights reserved.” or you can create an additional page or video on your website with rules for republishing. Whatever works best for you.
2. Use excerpt in feeds instead of full content
As I mentioned earlier, RSS feed scraping is huge. And keeping the full post in your RSS feed means allowing scrapers to copy your entire content. But if you use excerpts in the feed, you avoid any potential duplicate content issues or penalties. By default, WordPress’ excerpts are 55 words long or so, and if your main content is 1000 words long, having 55 duplicate words won’t be a big problem.
Sounds like a better plan, doesn’t it? Then go enable excerpt in your RSS feed.
3. Place signature link in feeds
Since RSS scraping is mostly done with tools, you can get credit much easier by trailing a signature link dynamically from each post in your feed.
You can add it manually or use the WordPress SEO by Yoast plugin, which gives you the option of adding a signature link before or after the feed contents.
4. Allow embedding shareable content by attribution
Text is not the only target for thieves. Shareable content like images, videos, or presentations can also get copied. People love them, so they want to steal share them!
Why not get one step ahead of your scrapers and allow them to embed your visual assets? Thus there would be no excuses for not adding a link back to your website, right?
5. Append a “Read more” link to manually copied text
There are still lots of scrapers that manually copy content from your website. For them you can simply append the source link to the copied text using web scripts or tools such as Tynt Copy-Paste.
Some may spot and delete your link, but rest assured that a lot of them won’t bother doing it.
6. Have internal links in your content
Those tricky scrapers that copy your content but make enough adjustments to pass any plagiarism check are hard to catch without a broad investigation. Moreover, getting credits from them is even harder. But having internal links in your content can help you a great deal in this case.
Just make sure you have internal links in your copy, and I guarantee that scrapers won’t bother to check and correct all of them.
7. Outreach people who used your content without attribution
Surprisingly (or not) you will also discover authoritative websites scraping your content, and you will definitely want to leverage those. Sort your list of scrapers, select the most authoritative and request an attribution link. It will be worth the effort.
For whomever copies your content, getting caught is something they’re (at least somewhat) afraid of. That is why outreaching to them for links will work.
I found that the most effective approach is “cold and threatening”. Here is an outreach email template you can use:
Subject: An important question about <Scraper site name>
Hi <Scraper name>,
I was looking around the web and found that you’ve republished one of my articles (URL: <the original content URL>. Unfortunately, I couldn’t find the credit link pointing to the original one.
That is an offense, isn’t it? I’m afraid that if you don’t place the credit, our DMCA license might take your site down, or there would be a case against you for breaking the copyright law.
To be fair, I don’t want to see you in trouble. We allow republishing of our content under proper attribution. So if you add a credit to our original content, there won’t be any problems. But you need to do this right away.
Here is the HTML code you can use to link to us:
This post first appeared on <a href=”http://www.example.com/original-content/”>Original content title</a> by <a href=”http://www.example.com/”>Original site name</a>.
Thanks a lot! I hope to get your reply soon, before we are forced to take action.
The owner of <Original site name>.
<Original publisher name>
Scrapers are everywhere, they’re difficult to prevent, and almost impossible to stop. We are left with only one option – to help Google and other search engines identify the scrape. This is the recommended way to avoid any bad consequences and incidentally gain benefits!
Note: The opinions expressed in this article are the views of the author, and not necessarily the views of Caphyon, its staff, or its partners.