Naive Bayes Classifier in C#

How many of you get annoyed by spam on a daily basis? To be honest, I haven't been annoyed by it in a while actually (although now that I've said that I'm sure the person who keeps spamming me through my site is going to kick it into overdrive). I mean I get tons of it but it usually doesn't make it through the spam filter. However I remember a time when 95% of my inbox was spam. Back then all we had at our disposal were black lists. But to be honest, they were rather useless. Then a man named Paul Graham posted an article called A Plan for Spam. In that article, he proposed using a bayesian filter to discover if a message was spam or not and claimed that in his tests, he had a 99.5% success rate of finding spam with no false positives (false positives being a legitimate message that is classified as spam). That one article (well that and a couple others) helped to spark interest in statistical analysis of spam as a way to fight it and specifically the usage of naive Bayes.

What is Naive Bayes?

Naive Bayes is the application of Bayes theorem using naive assumptions... What the hell does that mean? Basically what that is trying to say is that given a set of features (in spam those would be words), we calculate the probability of each item independantly from all of the other features (ie. if feature x is in the set, it doesn't affect the probability of y). For instance, in spam we have two sets of data: Spam messages and ham messages. And within those sets the individual words would be our features. And lets say we get a new message that contains the words "buy" and "viagra" in them. Both words would contribute their spam probabilities to the whole but the fact that they're both in the same message wouldn't matter.

How Do We Calculate It?

I'm going to show you how a basic spam filter would work, so below is a very, very basic example of a naive Bayes classifier (and rather poorly written actually). You're definately going to want to modify this for your purposes:

     public class NaiveBayes
    {
        public NaiveBayes()
        {
        }

        private List<string> _SetA = new List<string>();
        public List<string> SetA
        {
            get { return _SetA; }
            set { _SetA = value; }
        }

        private List<string> _SetB = new List<string>();
        public List<string> SetB
        {
            get { return _SetB; }
            set { _SetB = value; }
        }

        private int _TotalAmount=0;

        private int _TotalA = 0;
        private List<int> _SetATotals = new List<int>();
        public List<int> SetATotals
        {
            get { return _SetATotals; }
            set { _SetATotals = value; }
        }

        private int _TotalB = 0;
        private List<int> _SetBTotals = new List<int>();
        public List<int> SetBTotals
        {
            get { return _SetBTotals; }
            set { _SetBTotals = value; }
        }

        public double CalculateProbabilityOfTokens(List<string> Items)
        {
            double TotalProbability = 1.0;
            double NegativeTotalProbability = 1.0;
            for (int x = 0; x < Items.Count; ++x)
            {
                double Probability = CalculateProbabilityOfToken(Items[x]);
                TotalProbability *= Probability;
                NegativeTotalProbability *= (1 - Probability);
            }
            return TotalProbability / (TotalProbability + NegativeTotalProbability);
        }

        private double CalculateProbabilityOfToken(string Item)
        {
            if (_TotalAmount == 0.0||_TotalA==0.0)
                return 0.0;
            double Percent = 0.0;
            for (int x = 0; x < SetA.Count; ++x)
            {
                if (SetA[x].Equals(Item))
                {
                    if (SetATotals[x] == 0.0)
                    {
                        Percent = 0.0;
                        break;
                    }
                    Percent = (double)SetATotals[x] / (double)(SetATotals[x] + SetBTotals[x]);
                    break;
                }
            }
            double PriorPercent = (double)_TotalA / (double)_TotalAmount;
            return Percent * PriorPercent;
        }

        public void CalculateTotals()
        {
            List<string> TempSetA = new List<string>();
            List<int> TempTotalsA = new List<int>();
            List<string> TempSetB = new List<string>();
            List<int> TempTotalsB = new List<int>();
            _TotalA = 0;
            _TotalB = 0;
            _TotalAmount = 0;
            for (int x = 0; x < SetA.Count; ++x)
            {
                ++_TotalAmount;
                ++_TotalA;
                bool Found=false;
                for (int y = 0; y < TempSetA.Count; ++y)
                {
                    if (TempSetA[y].Equals(SetA[x]))
                    {
                        Found = true;
                        ++TempTotalsA[y];
                        break;
                    }
                }
                if (!Found)
                {
                    TempSetA.Add(SetA[x]);
                    TempTotalsA.Add(1);
                    TempSetB.Add(SetA[x]);
                    TempTotalsB.Add(0);
                }
            }

            for (int x = 0; x < SetB.Count; ++x)
            {
                ++_TotalAmount;
                ++_TotalB;
                bool Found = false;
                for (int y = 0; y < TempSetB.Count; ++y)
                {
                    if (TempSetB[y].Equals(SetB[x]))
                    {
                        Found = true;
                        ++TempTotalsB[y];
                        break;
                    }
                }
                if (!Found)
                {
                    TempSetA.Add(SetB[x]);
                    TempTotalsA.Add(0);
                    TempSetB.Add(SetB[x]);
                    TempTotalsB.Add(1);
                }
            }
            SetA = TempSetA;
            SetB = TempSetB;
            SetATotals = TempTotalsA;
            SetBTotals = TempTotalsB;
        }
    }

The code above does a couple things. First it takes in two lists of strings (SetA being Spam and SetB being Ham). These lists would be individual words from the messages (although you can do phrases, word pairs, etc.). After those are added, you call CalculateTotals. This is simply calculating the number of times a word appears in the list and condensing the lists. After that, calling CalculateProbabilityOfTokens with a new set of strings will calculate the probability that the message falls under Set A (spam). The formula it uses to do that is the following:

P = Probability of individual word =  Times in Set A /(Times in Set A + Times in Set B)

Probability of set of words = (P1 * P2 * P3 * ... * Pn) / ( (P1 * P2 * P3 * ... * Pn) + ( (1 - P1) * (1 - P2) * (1 - P3) * ... * (1 - Pn) ) )

That's it really. The algorithm above also takes into account the probability that an individual word is in the spam set and multiplies P by that, but to be honest it isn't needed (since it's most likely to be .5 anyway).

What's the Down Side?

There are a couple of downsides to this approach:

  1. It takes a large data set of spam and non spam messages for this to be accurate (but it does get better with time).
  2. It can be tricked by using large amounts of nonsense words.
  3. It can be a bit slow without some sort of hashing of the words (you basically have to search each word which could total in the millions).
  4. There are better approaches out there now including SVMs...
So while this approach is cool, there are some other approaches out there that are worth looking into. Anyway, I hope this helps someone out. Give it a try, leave feedback, and happy coding.
kick it on DotNetKicks.com   Shout it
Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkListEmail

Posted by: James Craig
Posted on: 3/31/2009 at 11:09 AM
Tags: ,
Categories: C#
Post Information: Permalink | Comments (6) | Post RSSRSS comment feed

Creating Negative Images using C#

This is another image processing post. In this post, I'm going to show you some basic code to create negative images. Generally speaking a negative image isn't that useful. However it does have the ability to help show differences within the image and plus it just looks cool. Basically what we're doing is a logical not (just flipping some bits). So if this were a 1bit image we'd just be flipping the 1s to 0s and 0s to 1s. However no one uses a 1bit image and instead we're using 24/32bit images (or 16 or 48 or 8...). So let's see what we need to do:

        public static Bitmap Negative(Bitmap Image)
        {
            System.Drawing.Bitmap TempBitmap = Image;
            System.Drawing.Bitmap NewBitmap = new System.Drawing.Bitmap(TempBitmap.Width, TempBitmap.Height);
            System.Drawing.Graphics NewGraphics = System.Drawing.Graphics.FromImage(NewBitmap);
            NewGraphics.DrawImage(TempBitmap, new System.Drawing.Rectangle(0, 0, TempBitmap.Width, TempBitmap.Height), new System.Drawing.Rectangle(0, 0, TempBitmap.Width, TempBitmap.Height), System.Drawing.GraphicsUnit.Pixel);
            NewGraphics.Dispose();
            for (int x = 0; x < NewBitmap.Width; ++x)
            {
                for (int y = 0; y < NewBitmap.Height; ++y)
                {
                    Color CurrentPixel = TempBitmap.GetPixel(x, y);
                    Color TempValue = Color.FromArgb(255-CurrentPixel.R,255-CurrentPixel.G,255-CurrentPixel.B);
                    NewBitmap.SetPixel(x, y, TempValue);
                }
            }
            return NewBitmap;
        }

If you look at the loop, all we're doing is taking the current pixel value and subtracting that from 255 for red, green, and blue values. So instead of just flipping bits we need to subtract the current pixel value from 255. That's all there is to it really. Anyway, I hope this helps you out. Give it a try, leave feedback, and happy coding.

kick it on DotNetKicks.com   Shout it
Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkListEmail

Posted by: James Craig
Posted on: 3/26/2009 at 9:39 AM
Tags: ,
Categories: C#
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed

Setting Up Your Website to Work With Windows Live Writer

Windows Live Writer, for those of you that aren't aware of the application, is an application to help you do posts, upload images/video, and manage a blog/website. It's one of the better apps of it's type out there and has quite a few addins that really make using it worthwhile (although you're definately going to want to download a file uploader addon). On top of that, WLW works with most of the blogging platforms out there (Blogger, Wordpress, BlogEngine.Net, etc.).

So it sounds great, but what happens when you're creating your own blogging platform and want to add integration? You turn to an API called MetaWeblog (well that and portions of the Blogger API but I'll mention that in a second). MetWeblog is an API based on XML-RPC. It was designed to allow outside apps to get the text of blog posts and modify them or do new posts. Actually that's not 100% true. We had the Blogger API that did that, but it had a number of flaws. The MetaWeblog API was designed to add that functionality that was lacking. That also explains why we're going to use both of the APIs in order to hook up Windows Live Writer to our site.

What do I need to set up?

Anyway, MetaWeblog has a number of functions that it defines in order to accomplish what it needs to do:

  • newPost - adds a post
  • editPost - edits a post
  • getCategories - gets a list of categories
  • getPost - gets a post
  • getRecentPosts - gets a list of recent posts
  • newMediaObject - used to upload a file/object

On top of that we need a couple functions from the Blogger API:

  • deletePost - deletes a post
  • getUserInfo - gets a user's info
  • getUsersBlogs - gets a user's list of blogs (there may be more than one)

Now there are other APIs that you can add to this list (WordPress, etc.) but these are the main functions that you'll most likely use. So now that we know what we need to implement, we need to know how.

How do I set it up?

The application really only requires a couple of things:

  1. RSD
  2. WLW Manifest
  3. Handler for implementing the functions

The RSD (Really Simple Discovery) is something that I've talked about before and won't go over in great detail here. The one item that you need to add to the APIs list is a single entry:

<api name="MetaWeblog" preferred="True" apiLink="http://SERVERNAME/MetaWeblogHandler.ashx" blogID="http://SERVERNAME/"/>

That entry lets Windows Live Writer know that it can connect to our handler at MetaWeblogHandler.ashx. After we set up our RSD file, we need to set up our WLW Manifest. The WLW Manifest lets Windows Live Writer know what it can do. The manifest is simply an XML file that can be found at http://www.MYSERVERNAME.com/wlwmanifest.xml. There are a lot of different options and looking at this link will tell you what they do. But just a default file is going to look like this:

<?xml version="1.0" encoding="utf-8" ?>
<manifest xmlns="http://schemas.microsoft.com/wlw/manifest/weblog">
<options>
<clientType>Metaweblog</clientType>
</options>
</manifest>

This just lets it know to use the Metaweblog defaults, but you can go in and change it to your hearts content (although that means you'll have to add the functionality). So that's steps 1 and 2. So what about step 3? To be honest, doing this yourself is a pain (mainly because you have to implement XML-RPC). Thankfully someone was nice enough to write a library for us: XML-RPC.Net.  You're going to have a couple issues compiling it since they don't include a signed key, but just add your own and you're good to go.

So once you download that library and get it set up, we have very little left to do and really it's only a couple of steps:

  1. Add the DLL as a reference within our Web App (note that I use Web Apps and not Web Sites, so I'm going to explain based on that)
  2. Copy over the interfaces for Blogger and MetaWeblog from the interfaces directory of the XML-RPC.Net library.
  3. Create a new handler
  4. Have the handler inherit from the two interfaces as well as XmlRpcService
  5. Go through and implement the functions that are listed above
  6. Add the handler to your web.config file

So basically what you'll end up with is something that will look like this:

    public class MetaWeblogHandler:XmlRpcService,IMetaWeblog,IBlogger
    {
        #region Constructor
        public MetaWeblogHandler()
        {
        }
        #endregion

        #region IMetaWeblog Members

        public object editPost(string postid, string username, string password, CookComputing.MetaWeblog.Post post, bool publish)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //edits a post
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public CategoryInfo[] getCategories(string blogid, string username, string password)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //Gets the list of categories
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public CookComputing.MetaWeblog.Post getPost(string postid, string username, string password)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //Get a single post
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public CookComputing.MetaWeblog.Post[] getRecentPosts(string blogid, string username, string password, int numberOfPosts)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //Get a list of recent posts
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public string newPost(string blogid, string username, string password, CookComputing.MetaWeblog.Post post, bool publish)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //add a new post
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public UrlData newMediaObject(string blogid, string username, string password, FileData file)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //Add the item to your site
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        #endregion

        #region IBlogger Members

        public bool deletePost(string appKey, string postid, string username, string password, bool publish)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                    //delete the post
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public object editPost(string appKey, string postid, string username, string password, string content, bool publish)
        {
            throw new NotImplementedException();
        }

        CookComputing.Blogger.Category[] IBlogger.getCategories(string blogid, string username, string password)
        {
            throw new NotImplementedException();
        }

        public CookComputing.Blogger.Post getPost(string appKey, string postid, string username, string password)
        {
            throw new NotImplementedException();
        }

        public CookComputing.Blogger.Post[] getRecentPosts(string appKey, string blogid, string username, string password, int numberOfPosts)
        {
            throw new NotImplementedException();
        }

        public string getTemplate(string appKey, string blogid, string username, string password, string templateType)
        {
            throw new NotImplementedException();
        }

        public UserInfo getUserInfo(string appKey, string username, string password)
        {
            throw new NotImplementedException();
        }

        public BlogInfo[] getUsersBlogs(string appKey, string username, string password)
        {
            try
            {
                if (System.Web.Security.Membership.ValidateUser(username, password))
                {
                     //Return the user's blogs
                }
            }
            catch { throw new XmlRpcFaultException(0, "An error occurred while loading information"); }
            throw new XmlRpcFaultException(0, "User name or password is invalid");
        }

        public string newPost(string appKey, string blogid, string username, string password, string content, bool publish)
        {
            throw new NotImplementedException();
        }

        public bool setTemplate(string appKey, string blogid, string username, string password, string template, string templateType)
        {
            throw new NotImplementedException();
        }

        #endregion
    }

Obviously with the code bits fleshed out (and the validation code should be whatever it needs to be). But that's all there is to it. So try setting it up, leave feedback, and happy coding.

kick it on DotNetKicks.com   Shout it
Digg It!DZone It!StumbleUponTechnoratiRedditDel.icio.usNewsVineFurlBlinkListEmail

Posted by: James Craig
Posted on: 3/23/2009 at 11:18 AM
Tags: , , ,
Categories: ASP.Net | C# | Web Design
Post Information: Permalink | Comments (0) | Post RSSRSS comment feed