adventures in Bayesian filtering
Prologue: As noted on my about this blog page, Al-Muhajabah's Movable Type Tips is intended to be a discussion of MT tips and tricks that I am making use of on my site. For this reason, although there are a lot of great ideas out there, if I'm not using it myself, I usually don't write about it.
There are a few things that I used to do at one time, but no longer am; these are noted in the appropriate entries. In most cases, I discovered that the plugin or script used up too many server resources or slowed rebuilds down too much by having to rely on external input. I found these things useful at one time and people on more generous hosting plans may continue to do so.
However, this entry is a little different because it's something I decided not to use after all before I had written anything up about it. I thought that my experiences might be beneficial to others, so I've created a new category called "reviews" for it. The other things I'm no longer doing will also be moved to this category.
If you've ever used an email program that allows you to mark mail as spam, you've probably experienced Bayesian filtering. Basically, you create a "whitelist" of messages that are definitely not spam, and a blacklist of messages that definitely are spam, and a computer program evaluates new messages based on their similarity to messages on your whitelist and blacklist in order to determine whether they're spam or not.
Bayesian filtering is far from perfect. We all still get tons of spam in our inboxes, while important messages go to the junk mail folder. But trying to filter by content is the only rational solution when most spammer email addresses are used once and thrown away.
In the ongoing fight against comment spam, James Seng developed MT-Bayesian, which implements Bayesian filtering for comments and trackbacks. MT-Bayesian is primarily a stand-alone script, although it includes a plugin to create new template tags for displaying (or not displaying) spam comments. It actually replicates the entire MT interface so that you could theoretically use it in place of mt.cgi if you wanted to. The additional features that it includes specific to Bayesian filtering are the lists of comments and trackbacks (something that it would be very nice to have built in to MT).
In the listing you see the comment author name, email address, URL, IP address, and comment text, as well as the date and time the comment was posted, and the title of the entry it was posted on. Each comment is assigned a default spam probability of 50%. You can review the comment then mark it as spam or not spam. After you click the "train" button, all spam comments have their probability re-set to 100% and all non-spam comments have their probability re-set to 0%.
The idea is that eventually, MT-Bayesian will start to predict whether or not a comment is spam. James notes that it would take about 1000 comments for this to really be effective.
My own experiences bear this out. I trained MT-Bayesian on more than 1400 comments posted to Al-Muhajabah's Islamic Blogs (no, I have no life). After the first few dozen, some comments would be given a 0% spam probability automatically, but this seemed to be largely by author; comments with the same email address and/or URL would become "trusted". This didn't work 100%; several commentors who didn't include a lot of links in their comments remained at 50% even after I had marked dozens of their comments as not spam. It was not until I had gone through about 1200 comments that MT-Bayesian began to mark certain messages as spam. At first, it seemed to be marking everything as spam, then it would mark all new commentors as spam and some comments by others.
I was actually hoping to use MT-Bayesian to mark not spam but troll comments, some of which are still on my system (I had disemvoweled them) and most of these occurred at about the point in my comments listing that MT-Bayesian "woke up". The good news is that it almost always marked as spam comments by authors whose other comments had been manually marked as spam. The bad news is that it continued to mark all comments by new authors as spam. It also continued to mark some comments by otherwise trusted authors as spam, apparently seeing some similarity to the troll comments.
Which brings me to why I decided not to use MT-Bayesian after all. The first reason is that it tends to be too happy to mark comments as spam. The new template tags are <$MTIfSpam$> and <$MTIfNotSpam$>. The <$MTIfNotSpam$> tags, which you are recommended to use in your comment listing templates to only display non-spam comments, will only display comments that have a spam probability of 20% or less. By default, comments are assigned a spam probability of 50%, which means that new comments will not be displayed until you whitelist them. And most of your old comments will also be hidden. Not good. That was why I was going through and whitelisting all my old comments except the troll comments, so that those at least would show up properly and only the troll comments would be marked as bad. If that were all, I would eventually have persevered and gone through all 3200+ comments. But it's the way it treats new comments that finally made me give up. I didn't want new comments from trusted friends to be marked unpredictably as spam. Perhaps using Bayesian filtering to detect "troll" content isn't a good idea. In any case, it seemed like it was going to be more work than it was worth.
The second thing is that it did take more than 1200 comments for MT-Bayesian to begin functioning like a regular Bayesian filter. A handful of sites gets hundreds of comments every day but for most of us, 1200 comments is the result of a year or more of blogging. Who wants to wait a year for MT-Bayesian to start working, or to spend hours and days training MT-Bayesian on the old comments? And who knows how long it would take to work even as well as email spam filters do? Those work because there are millions of messages to train the program on; it's not just you but every person using that mail server who is training the program.
MT-Bayesian might work if it were part of TypePad, with a large, centralized user base. But it's hard to see how it would be very helpful to the average solo MT user. Bayesian filtering for comment spam is a clever idea, but standalone MT installations are not the right place to test it.
Subscribe to this blog's feed
Comments
Greetings from Malaga (Spain). Antonio :-)
Posted by: Malaga | March 27, 2005 08:14 AM