« New eBay Phishing Trick | Main | The Most Terrifying Spam of All »

December 11, 2005

Web Site News - Another Go-Round Versus the Spammers

by Ferdinand T Cat

The web site will be funky for a while as Bruce goes through another cycle of anti-spam tuning.

We did learn one important thing: when you're debugging the comment spam filter, you can't be signed into TypeKey. TypeKey customers bypass the spam filters. Thus, it's possible to spend several hours trying to figure out why the spam filter isn't getting called just because you're signed in as "Bruce Parrello" instead of "Bruce the Human Pet".

The problem is that we can't rely on keyword filters and link counts any more, because the spam feedback is almost completely content-free. The one advantage we have is that a false positive can be corrected manually. So, if Bruce's new algorithm catches too much in the net, it's less of a problem than catching too little. Still ,we want the user to have positive reinforcement, so we want to allow as much good stuff through as we can.

The biggest problem with the built-in Movable Type spam prevention is that it treats trackbacks and comments as the same thing. They aren't. A real comment is generally full of content. Trackbacks have very little content (25 words or less), but are guaranteed to contain HTML. Furthermore, legitimate comments are generated interactively by human beings, but legitimate trackbacks are generated by machines, just like spam comments.

The keyword-based method we've been using up until now has had the severe drawback that if we wanted to prevent casino owners from flooding us with trackback spam, we also had to prevent commenters from talking about gambling.

So, Bruce's new filter uses a two-pronged approach. We use a CAPTCHA code for comments, and a content-based filtering scheme for trackbacks. In the case of the trackback, the content filtering is restricted to the blog name, the blog title, and the URLs.

We have one advantage. Trackback spammers generally aim for old articles. This supposedly increases their chance of getting a good google ranking out of the link, and it also makes it less likely for the blog owner to see the trackback. Our big gun, therefore, is to junk trackbacks on articles more than 30 days old. We've also been experimenting with things like counting the number of slashes in the target URL, examining relationships between the blog title and the article title, and so forth. Tonight the blog will be extremely unstable, but once we get this new framework implemented, future changes should be simple and fast.

Respectfully submitted,

Ferdinand T. Cat


# At Sun 11:32 PM | Permalink | Trackback URI | Comments (1) | More Web Site News

Trackback Pings

» Lets Make Christmas, Christmas Again! from Oblogatory Anecdotes
Santa Clause, Christmas Trees, lights, candles, shopping at the mall, and presents have diluted what Christmas is supposed to be all about. Answering Charlie Brown’s question “What is Christmas all about?” on A Charlie Brown Christmas, Linus recites ... [Read More]

Tracked on December 11, 2005 11:50 PM

» Bonfire of the Vanities #128 from Free Money Finance
Welcome to this week's edition of the Bonfire of the Vanities. I'm so excited to be the host as this is one of my favorite blog carnivals. Why? Where else can you make fun of what people have written and [Read More]

Tracked on December 13, 2005 6:05 AM

Comments

I hope Bruce's hard work will not be in vain. Those spammers are sneaky. Free hold-em viagra poker.


Posted by: Sean Gleeson at December 12, 2005 5:13 PM

HTML is not allowed in comments; however, if you put in a raw URL (http://www.somewhere.com/page.html) it will automatically be converted to a link.. Also, it is likely your comment will not appear unless you refresh the page manually after posting it.

Post a comment