Since 2009, I have been working on a research project to learn a little bit more about spam. Though I have a fundamental thesis that I’m working on data to support (so far it looks good), the proof, as they say, is in the pudding.
Last week, I started analyzing the pudding. Let’s start with the basics – I’m not a developer, I’m a script kiddie at best. I’ve written code before – but it’s in languages you laugh at and with a level of commenting and error handling that will make you cry – and not the good kind of cry.
I’ve been collecting spam messages off and on since 2009, and have a considerable stash of spam from 2010 and everything so far from 2011. I have had the same email address since 1995 when I registered getwired.com. As a result, I get tons of spam. Tons. SpamSieve handles most of it, but misses some, and doesn’t work when my Mac is asleep. I manually process everything left over and put it in one directory on my server.
To begin the extraction process, I have copied everything to a local offline file in Outlook 2010. Here’s where things get weird/cool. Did you know Access 2010 can link to an Outlook folder as a data source? I didn’t. While I can’t get everything I want (headers), I can get most of what I need, and then query it using WSH over ADO (go ahead and laugh now). From here, I output a text file for each message to do further processing. This process is beyond duct tape, but it does what I need at the moment.
Problem is, as I was using the FileSystem Object (FSO) to output these text files, it would begin and proceed through many messages only to blow up on some seemingly random spam message. I was able to figure out that it was the message body that made it blow up. The error was a very helpful Invalid procedure call or argument with a hex error of 800a0005. So of course I immediately knew what the issue was. I’m lying. I had no idea. I beat my head against the wall for awhile, and then committed the cardinal sin of VBScript programming (and this is WSH, so it’s already sinning). I added On Error Resume Next to the top of my script and moved on. Didn’t fix it, but now I could analyze the messages that didn’t hurl, and throw away the junk.
It bothered me (OCD) not having the other messages get processed, so last night I started debugging it more deeply. A Google search of the error turned up a bunch of things, but most notably some people hitting this issue with FSO to be told, “you’re trying to write Unicode in an ASCII file”. Sure enough, once I added the final,True argument to all of my FSO calls, everything worked. While it would be nice if Windows was smart enough to figure it out on its own, it was nice to have the issue put to bed.
So the irony here was that I had to actually switch to using Unicode everywhere because spammers (in addition to doing some pretty conniving stuff to get into your inbox and get you to click on things), are actually sending messages in Unicode, most likely not to enable localized text, but to evade spam-sniffing tools that (ahem, like mine) blow up or skim over when Unicode text comes along.