« Camp Katrina Wants to Change Hurricane Naming System | Main | New Phishing Scam »

October 24, 2005

Blogging Hell - Pong Prevention and the Mysteries of Movable Type Spam

by Ferdinand T Cat

We are all familiar with the concept of pongs, though we may not be familiar with the term. A pong is a duplicate trackback ping. It's very easy to create these things in today's environment. When you are publishing an article that has a lot of outgoing pings, half of them will fail, and you have to save the article over and over again before all of them go through.

If the target server is slow in responding, the ping will fail with a HTTP error 500 read timeout. The blog will keep the ping in the outgoing ping box, but in almost every case, you'll find that the ping went through. The next time you save the article, a duplicate ping will be sent. We sometimes get as many as identical copies of canrival pings in this way.

As bad as the duplicate pings can be in normal blogs, if you send a duplicate ping to a Haloscan blog, it will reply with a nasty message accusing you of being obsessive. Since the message is an error message, the duplicate ping stays in your outbox and you will soon find yourself the recipient of dozens of messages disparaging your mental state. A flurry of these after a particularly long carnival post so unnerved Bruce that he made an appointment with a psychiatrist. (Eventually, they decided that Bruce's brain is like his liver: it's not working correctly, but they don't know why.)

To avoid this problem, when you get timeout errors, you must delete the offending pings from the outgoing ping box and then re-save the article one more time so that the blog software knows that ping is gone.

If you have Movable Type 3.2 and you know a little bit about programming in PERL, there is a way to modify your blog software to prevent incoming pongs. First, however, we need to discuss the strange way spam prevention works in Movable Type.

Every incoming comment or ping in Movable Type is analyzed by three SpamLookup plugins, and each one returns a negative number called the junk score. The scores are combined, and if the result is less than 0, the comment or trackback is placed in the junk folder for disposal. It's important to discuss each one in detail because Movable Type's recommended defaults are totally wrong.

  1. Lookup checks the IP address and domain name against known spammer IDs. It also junks trackbacks that appear to be from suspicious sources. This latter test (known as Basil's Bug) almost always generates a false positive, so be sure to turn it OFF in your configuration.
  2. Links examines the links in the comment or trackback. It will moderate or junk a comment when it has too many links. Almost all spam contains at least 3 links, so you want to set this to junk a comment with 3 links and turn moderation off. This is also the module that gives people a bonus if their EMAIL address or URL has already been published, so it's possible for this one to come back with a positive score rather than a negative one.
  3. Keyword Filter looks in the entire text of the trackback or comment (including the URL, email address, and user name) and penalizes it for each junk word. My list is shown below. The numbers indicate the penalty for each item, the slashes mean that the string is actually a regular expression instead of a word. Inside the slashes, a dot (.) is a wild card character. After the slashes an i indicates a case-insensitive match. So, for example, "cialis" must appear as a word surrounded by punctuation and its junk score is 2. "big boobs", however, is caught no matter where it occurs, and the character between "big" and "boobs" can be anything-- a space, a period, an underscore, or even a hypen.

    cialis 2
    /casino/i 2
    phentermine
    /poker/i 4
    /you may find it/i 2
    /<h/i 2
    /update your site soon/i 3
    favourits 4
    payday loans 3
    foo 2
    /foosh/i 3
    /big.boobs/i 3
    The "/<h/" pattern is particularly important. Google looks for heading tags when it does its ranking. Approximately 80% of the comment spam we get has <h1> tags in it.

Now here's the extremely weird part: the junk score is the average (arithmetic mean) of the non-zero scores from the three modules. So, if a comment contains "poker" it has a score of 4, but if the comment contains poker and comes from a banned domain, it has a score of 2.5, because there are 2 results (4 and 1) and (4+1)/2 is 2.5.. This is why the weights on the keywords are so high.

The module that controls junk filtering is \cgi-bin\mt\plugins\spamlookup\lib\spamlookup.pm. The hack below will cause a duplicate ping to be recognized as a junk trackback. This prevents the ping from appearing on your blog or causing a rebuild. More important, it still counts as a successful ping, so the sender is told he's done and there's no need to send it any more. The junk score returned below is 5, which is pretty high. That enables it to overcome the weirdness of the way the weights are combined.

This code replaces the elsif clause beginning at line 148 of the un-hacked version, which is inside the subroutine link_memory.

            } elsif (UNIVERSAL::isa($obj, 'MT::TBPing')) {
                my $url = $obj->source_url;
                $url =~ s/^\s+|\s+$//gs;
                # Look for a ping with the same URL and target.
                my $t = MT::TBPing->load({ source_url => $url,
                    blog_id => $obj->blog_id,
                    tb_id => $obj->tb_id,
                    visible => 1 });
                if ($t) {
                    # We found one. Give it a junk score of -5.
                    return (-5, "Duplicate of ping " . $t->id . ".");
                } else {
                    # Not a duplicate. Check to see if this link was previously published.
                    # If so, we consider it safer than an unknown link.
                    $t = MT::TBPing->load({ source_url => $url,
                        blog_id => $obj->blog_id,
                        visible => 1});
                    if ($t) {
                        return ((int($config->{priorurl_weight}) || 1),
                            "Link was previously published (TrackBack id " . $t->id . ").");
                    }
                }
            }

Typepad uses the Movable Type software, so some of these hints and tips will apply to Typepad as well. If someone with a Typepad account and spam problems would like to let Bruce help tune your system, send him an EMAIL and he will get in touch with you to work out the details.

Respectfully submitted,

Ferdinand T. Cat


# At Mon 3:09 AM | Permalink | Trackback URI | Comments (4) | More Blogging Hell

Trackback Pings

» Dear TypePad (Part 2) from Basil's Blog
UPDATED Dear Loyal Readers: No, I've not returned to full-time bloging, but I am dropping by to provide an update. After my last post about this TypePad TrackBack ping issue I got a response from Anil Dash at TypePad. I [Read More]

Tracked on October 24, 2005 4:41 AM

» The call it Basils Bug from MacStansbury.org
Reading Basil’s Blog: Dear TypePad (Part 2) this morning, I got the lowdown on the constant struggle against a tyrannical TypePad versus it’s own customers. At issue is the free, and open, trafficking of…um…traffic. Long story short, TypePad trackbacks... [Read More]

Tracked on October 24, 2005 9:53 AM

» Inline trackbackation? from NIF
Today's dose of NIF - News, Interesting & Funny ... + Guard Our Borders day! [Read More]

Tracked on October 24, 2005 5:32 PM

» Bonfire of the Vanities: Week 121 from Random Numbers
When you set out to host the Bonfire of the Vanities with the goal of not turning out a version that won’t belong in next week’s Bonfire you may have just bitten off more than you can chew. At least I may have. It’s hard to comment ... [Read More]

Tracked on October 25, 2005 4:32 AM

» Dear TypePad, Part III from basil's blog
Dear TypePad: Congratulations! It looks like you have fixed the TrackBack issue! Well, okay, it’s not really fixed. But it’s back to where it was 20 days ago! Which means that third-party forms (such as the Wizbang and Kalsey forms, and... [Read More]

Tracked on November 2, 2005 10:26 PM

Comments

OK, how do I get those big.boobs SPAM you were talking about?

Good post


Posted by: don surber at October 24, 2005 7:05 AM

I think my problem was a large number of pages with the words "cat" and "house" in the text, but that's just a guess. I can't pretend to understand what drives these lunatics.


Posted by: Ferdy Author Profile Page at October 24, 2005 4:14 PM

ok


Posted by: bob at November 23, 2005 12:39 AM

Wow!!! Good job. Could I take some of yours triks to build my own site?i


Posted by: jammarlibre at June 10, 2008 3:11 AM

Leave a comment

HTML is not allowed in comments; however, if you put in a raw URL (http://www.somewhere.com/page.html) it will automatically be converted to a link.. Also, it is likely your comment will not appear unless you refresh the page manually after posting it.

Leave a comment