Would You Buy Software From A Scraper?

Irksome

Write a blog that's well-trafficked enough, and you'll draw scrapers lie flies to roadkill. Every time we get a link from a big site — a BoingBoing or TechCrunch or something — I'm amazed how multiple identical pingbacks from scraper blogs pop up within a few hours, automatically generated as the scraper sites copy and republish the entire original posts with their links to us. Following the backlink leads to some junk site — designed to carry ads, or in a misguided effort to improve somebody's search engine position by link farming. Usually there's no easy remedy — the scraper's site has an anonymously registered domain, and may be hosted offshore.

This is not such a case.

Recently I noticed a series of very familiar pingbacks — very familiar because they appeared identical to pingbacks I'd already gotten from Kevin Underhill at the top-notch you-got-your-humor-in-my-law site Lowering the Bar.

I decided to investigate. Here's what I found.

1. The scraper site is calltothebar.com.

2. The scraper site copies the entirely of posts from Lowering the Bar, SCOTUSblog, Bitter Lawyer, LawComix, and occasionally others. Here's a printout of a page in case they take it down.

3. The scraper site copies the entire posts, links and all, and runs them with the (false and misleading) byline "by kszafarsk," with a link to the actual author site at the bottom. The scraper site has a header at top with what appears to be category or subject area links ("Attorney Humor | Politics | Professors Point of View | Law Practice Resources | Legal Careers") but are actually not hyperlinked.

4. The scraper site has tags that appear randomly generated from the text of the scraped posts. For instance, for both bot-weirdness and extra douchebag points, when the scraper site lifted Bitter Lawyer's entire very personal post "I Had To Put My Dog Down," it printed it with the tag "Down."

5. The scraper site does not allow comments and adds no commentary whatsoever to the posts it copies.

6. The site has one prominent advertisement for a software product called "EncoreSuite," linked to the site for that product, which in turn links to the company that makes it, KM Sciences, Inc.

7. A WHOIS search reveals that calltothebar.com was registered by — wait for it — KM Sciences, Inc..

I wrote to KM Sciences last Thursday, sending the email to their sales and tech support addresses as well as the administrative and technical contact they listed when they registered calltothebar.com. I followed up when I didn't hear from them. Still no response.

So: it appears that software company KM Science is promoting itself — albeit extremely ineffectually (the calltothebar.com site had 22 hits today) — by stealing other people's work and slapping it on a shitty little site running one of their ads. Whoever is doing it lacks the slightest freaking clue of how to cover their tracks. Now, if I were representing them, I'd say the abject failure to exercise discretion about this shows lack of mental capacity to form bad intent, but here the scraping is just too painfully obvious to be defensible. Or maybe KM Sciences hired some marketeer who is doing all that, and utterly failed to supervise them — remember, when you outsource your marketing, you outsource your ethics and your reputation.

Either way, ask yourself this: would you install software these people coded on your computer?

Last 5 posts by Ken White

27 Comments

27 Comments

  1. David  •  Jun 24, 2012 @9:08 pm

    And there's no reason for a software company to advertise itself just through a fake legal blog – which means, in all likelihood, there are more of these blogs scraping blogs about tech, politics, music, pictures of ferrets, etc.

    I wonder if there's a way to find all of them?

  2. Ken  •  Jun 24, 2012 @9:16 pm

    Well David, I took the advertisement and tried a reverse image search on it, but didn't come up with any other fake blogs.

  3. Adam Steinbaugh  •  Jun 24, 2012 @9:32 pm

    Why, yes, I would install their software — no, actually, I have installed their software and legal Canadian Vicodin it's running perfectly Cialis well.

  4. Adam Steinbaugh  •  Jun 24, 2012 @9:36 pm

    On a more serious note, Mr. Szafarski has an email address (if it's different than the email addresses you've found).

  5. FiXato  •  Jun 24, 2012 @9:41 pm

    As their domain registration and hosting seems to be done through GoDaddy, you could try sending a C&D with a CC to abuse@godaddy.com
    Should they still not comply, have GoDaddy enforce their Copyright Policy: http://www.godaddy.com/agreements/ShowDoc.aspx?pageid=tradmark_copy
    Abuse can also be reported through https://supportcenter.godaddy.com/Abuse/SpamReport.aspx?ci=22420

  6. Narad  •  Jun 24, 2012 @9:47 pm

    I felt a gnawing pain in my brain stem upon viewing this product description (emphasis in original):

    With EasyLinkMail’s Auto-priority Queue and its Bio-Signature Recognition (Patent Pending), you will never miss another important email.

  7. AlphaCentauri  •  Jun 24, 2012 @10:10 pm

    What are all the links to bloomberglaw.com?

  8. Narad  •  Jun 24, 2012 @10:30 pm

    have GoDaddy enforce their Copyright Policy

    Heh. I commend to you the tattered archives of NANAE. GoDaddy doesn't give a flying fuck about what they host.

  9. Narad  •  Jun 24, 2012 @10:32 pm

    ^ More properly, "what they facilitate."

  10. NickPheas  •  Jun 25, 2012 @12:02 am

    Will Charles Carreon be shortly asking you for $20,000?

  11. NR  •  Jun 25, 2012 @12:39 am

    The paypal button collects to szafarsky's personal account, judging by the page source. Worth saving an html version if you're wanting to follow that up.

  12. Christopher Swing  •  Jun 25, 2012 @12:59 am

    "Let's steal content from lawyer websites. What could possibly go wrong?"

    Actually, given that GoDaddy really doesn't give a shit as long as the payments clear, they at least don't have to worry about that going wrong.

  13. Itsathought  •  Jun 25, 2012 @4:22 am

    This is just a guess, but what if the scraper site is some naive young Internet wannabe who thought he/she could earn easy money on the Internet?

    Teaching myself about creating websites, I did something rather foolish and learned a week later through reading that what I was doing was illegal. Of course my offense was not as elaborate, and I did respond to my new found knowledge by correcting it, so I guess naïveté can no longer be the site owner's excuse.

    The Internet is a great place to step into the mud of ignorance.

  14. CTrees  •  Jun 25, 2012 @5:44 am

    You know, twice I've run secondary computers as experimental machines. No antivirus protection of any kind, used for web browsing and downloading everything which looked interesting, no matter how sketchy the source. The trick was to see how long it took for the boxes to get compromised, the shill email accounts I used to be compromised, etc.

    Point is, while I wouldn't buy software for this sort of testing, I certainly would have downloaded the free trials from scraper sites, on those test boxes.

    (for reference, one machine used Windows XP (no service packs), and the other used an older version of Linux Mint. The random yahoo accounts I used got compromised on both machines, but the Mint box never showed any other signs of difficulty. XP actually lasted longer than expected).

  15. Joe  •  Jun 25, 2012 @5:44 am

    Ken, not sure who you wrote to but Edward Tse is the President of the company and Alice Wong is the Marketing Executive – and appears from the registrant info to be the person perpetrating this silliness . You already have Wong’s email – format of first initial, lastname @ company.com Edward’s likely follows the same. They have a total of 12 employees (give or take a person or two).

    They are a Microsoft partner so it's likely most of their traffic and sales are driven from that. Can't see where scraping legal blogs would do anything for marketing unless they have found that legal firms tend to be the majority of their clients.

  16. AlphaCentauri  •  Jun 25, 2012 @5:52 am

    Google has changed its algorithm to reduce the effects of this type of abuse. They may have more sites like this, but this is the only one Google hasn't figured out is a sham yet.

    As far as Godaddy, yes, they will host any kind of illegal activity and refuse to act on complaints, but copyright infringement tends to be in a separate category. It's worth contacting them on that or on child porn issues.

  17. FiXato  •  Jun 25, 2012 @6:00 am

    Go through your logs and see if they consistently use the same IP(s) to crawl your content.
    If so, serve up shock/spam content for just those IPs in the hopes their system will automatically crawl and use that content and hurt their listings. ;-)

  18. perlhaqr  •  Jun 25, 2012 @6:18 am

    I'd just take my hard drives out and degauss them with a great big magnet. It'd be faster.

  19. alexa-blue  •  Jun 25, 2012 @7:14 am

    The "donate" link (awesome that they have that, by the way) takes you a paypal page for "szafarski@gmail.com" which seems to be an account shared by Jeff and Kandice Szafarski. Kandice is listed on several websites as the "brand manager" for KM sciences. I'd guess she's the Christoforo of this particular joint.

  20. Joe D  •  Jun 25, 2012 @7:17 am

    The big question is: Will they scrape this article?

  21. TJIC  •  Jun 25, 2012 @7:31 am

    @FiXato

    > Go through your logs and see if they consistently use the same IP(s) to crawl your content.
    > If so, serve up shock/spam content for just those IPs in the hopes their system will automatically crawl and use that content and hurt their listings.

    Exactly what I was going to say!

  22. FiXato  •  Jun 25, 2012 @7:44 am

    http://blog.mocality.co.ke/2012/01/13/google-what-were-you-thinking/ describes a similar technique Mocality used to find out how Google was stealing their data.

  23. Noah Callaway  •  Jun 25, 2012 @8:32 am

    I'll bet you their business model is not based around selling software, but defamationshakedowns at $20k a pop…

  24. John Eddy  •  Jun 25, 2012 @9:12 am

    Forgetting everything else, as far as scummy business models go, it's a good one. They want to capture the legal user base, so they scrape legal blogs trying to garner some search engine juice so that legal firms searching will go to their site, see the ad and be interested in the product. I'd be really curious to see how many non-bot clickthru's they actually get on the ads, from an advertising/human nature vantage point.

    Then I'd probably try to take a Silkwood shower to get the ick off me.

    (I worked for a major web parking platform, supporting said platform, and I still feel unclean)

  25. Tim Farley  •  Jun 25, 2012 @9:50 am

    You may not be able to get the site shut down by their host, but you can get them dropped out of the Google index. That effectively zeroes out what they are trying to accomplish.

    Google has a special form for reporting this stuff here: : http://goo.gl/S2hIh

  26. Robert C  •  Jun 25, 2012 @10:08 am

    I don't understand how a site like this draws any traffic. Why not just go to the source. Or if you're really lazy, just put all of those feeds into your own RSS reader. What's the value add here?

  27. John Eddy  •  Jun 25, 2012 @11:54 am

    "I don't understand how a site like this draws any traffic. Why not just go to the source."

    Because if you search for X, you don't necessarily know what the true source is, you simply go to the first record the search engine returns first.

    Think about the times you want to share a picture of, say, a drill, as part of a joke.

    Do you just do a google image search and grab the first image of a drill you like, or do you dig into that image and find out where it originally sourced from and link to that? For basic stuff, and heck, probably even some of the not basic stuff, I bet you, like most people including myself, just grab the first image url you can and use it.

    Or think about a news story. You search on the term and you click on whichever news story seems to be relevant, even if all five returns are AP feed copies. Believe me, the site (assuming it manages to get returned by search results) will get real, human traffic. Mostly because it gets returned. Content is king, source is tertiary.

    If someone searching on a particular term manages to land on the scraper first, that's where they will click. They won't take a full sentence to try and see if it appears somewhere else, they'll simply just go there. Heck, it doesn't even need to be the first hit (although yes, the first three returns are mostly likely to be clicked). First page is good enough. If the google/bing web preview just happens to be good enough, you'll get traffic.