3/31/2009

How many of you get annoyed by spam on a daily basis? To be honest, I haven't been annoyed by it in a while actually (although now that I've said that I'm sure the person who keeps spamming me through my site is going to kick it into overdrive). I mean I get tons of it but it usually doesn't make it through the spam filter. However I remember a time when 95% of my inbox was spam. Back then all we had at our disposal were black lists. But to be honest, they were rather useless. Then a man named Paul Graham posted an article called A Plan for Spam. In that article, he proposed using a bayesian filter to discover if a message was spam or not and claimed that in his tests, he had a 99.5% success rate of finding spam with no false positives (false positives being a legitimate message that is classified as spam). That one article (well that and a couple others) helped to spark interest in statistical analysis of spam as a way to fight it and specifically the usage of naive Bayes.

Naive Bayes is the application of Bayes theorem using naive assumptions... What the hell does that mean? Basically what that is trying to say is that given a set of features (in spam those would be words), we calculate the probability of each item independantly from all of the other features (ie. if feature x is in the set, it doesn't affect the probability of y). For instance, in spam we have two sets of data: Spam messages and ham messages. And within those sets the individual words would be our features. And lets say we get a new message that contains the words "buy" and "viagra" in them. Both words would contribute their spam probabilities to the whole but the fact that they're both in the same message wouldn't matter.

I'm going to show you how a basic spam filter would work, so below is a very, very basic example of a naive Bayes classifier (and rather poorly written actually). You're definately going to want to modify this for your purposes:

1: /*

2: Copyright (c) 2010 <a href="http://www.gutgames.com">James Craig</a>

` 3: `

4: Permission is hereby granted, free of charge, to any person obtaining a copy

5: of this software and associated documentation files (the "Software"), to deal

6: in the Software without restriction, including without limitation the rights

7: to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

8: copies of the Software, and to permit persons to whom the Software is

9: furnished to do so, subject to the following conditions:

` 10: `

11: The above copyright notice and this permission notice shall be included in

12: all copies or substantial portions of the Software.

` 13: `

14: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

15: IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

16: FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

17: AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

18: LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

19: OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN

20: THE SOFTWARE.*/

` 21: `

22: #region Usings

23: using Utilities.DataTypes;

24: using System.Collections.Generic;

25: using Utilities.Math;

26: #endregion

` 27: `

28: namespace Utilities.Classifier.NaiveBayes

` 29: {`

30: /// <summary>

31: /// Naive bayes classifier

32: /// </summary>

33: /// <typeparam name="T">The type of the individual tokens</typeparam>

34: public class NaiveBayes<T>

` 35: {`

36: #region Constructor

` 37: `

38: /// <summary>

39: /// Constructor

40: /// </summary>

41: public NaiveBayes()

` 42: {`

43: SetA = new Bag<T>();

44: SetB = new Bag<T>();

45: Probabilities = new Dictionary<T, double>();

` 46: Total = 0;`

` 47: TotalA = 0;`

` 48: TotalB = 0;`

` 49: ATokenWeight = 1;`

` 50: BTokenWeight = 1;`

` 51: MinCountForInclusion = 1;`

` 52: MinTokenProbability = 0.01;`

` 53: MaxTokenProbability = 0.999;`

54: MaxInterestingTokenCount = int.MaxValue;

` 55: }`

` 56: `

57: #endregion

` 58: `

59: #region Properties

` 60: `

61: /// <summary>

62: /// Set A

63: /// </summary>

64: public Bag<T> SetA { get; set; }

` 65: `

66: /// <summary>

67: /// Set B

68: /// </summary>

69: public Bag<T> SetB { get; set; }

` 70: `

71: private double Total { get; set; }

72: private double TotalA { get; set; }

73: private double TotalB { get; set; }

74: private Dictionary<T, double> Probabilities { get; set; }

` 75: `

76: /// <summary>

77: /// Weight to give to the probabilities in set A

78: /// </summary>

79: public int ATokenWeight { get; set; }

` 80: `

81: /// <summary>

82: /// Weight to give the probabilities in set B

83: /// </summary>

84: public int BTokenWeight { get; set; }

` 85: `

86: /// <summary>

87: /// Minimum count that an item needs to be found to be included in final probability

88: /// </summary>

89: public int MinCountForInclusion { get; set; }

` 90: `

91: /// <summary>

92: /// Minimum token probability (if less than this amount, it becomes this amount)

93: /// </summary>

94: public double MinTokenProbability { get; set; }

` 95: `

96: /// <summary>

97: /// Maximum token probability (if greater than this amount, it becomes this amount)

98: /// </summary>

99: public double MaxTokenProbability { get; set; }

` 100: `

101: /// <summary>

102: /// After sorting, this is the maximum number of tokens that are picked to figure out the final probability

103: /// </summary>

104: public int MaxInterestingTokenCount { get; set; }

` 105: `

106: #endregion

` 107: `

108: #region Public Functions

` 109: `

110: /// <summary>

111: /// Loads a set of tokens

112: /// </summary>

113: /// <param name="SetATokens">Set A</param>

114: /// <param name="SetBTokens">Set B</param>

115: public void LoadTokens(System.Collections.Generic.List<T> SetATokens, System.Collections.Generic.List<T> SetBTokens)

` 116: {`

117: foreach (T TokenA in SetATokens)

` 118: {`

` 119: SetA.Add(TokenA);`

` 120: }`

121: foreach (T TokenB in SetBTokens)

` 122: {`

` 123: SetB.Add(TokenB);`

` 124: }`

` 125: TotalA = 0;`

` 126: TotalB = 0;`

127: foreach (T Token in SetA)

` 128: {`

` 129: TotalA += SetA[Token];`

` 130: }`

131: foreach (T Token in SetB)

` 132: {`

` 133: TotalB += SetB[Token];`

` 134: }`

` 135: Total = TotalA + TotalB;`

136: Probabilities = new Dictionary<T, double>();

137: foreach (T Token in SetA)

` 138: {`

` 139: Probabilities.Add(Token, CalculateProbabilityOfToken(Token));`

` 140: }`

141: foreach (T Token in SetB)

` 142: {`

143: if (!Probabilities.ContainsKey(Token))

` 144: {`

` 145: Probabilities.Add(Token, CalculateProbabilityOfToken(Token));`

` 146: }`

` 147: }`

` 148: }`

` 149: `

150: /// <summary>

151: /// Calculates the probability of the list of tokens being in set A

152: /// </summary>

153: /// <param name="Items">List of items</param>

154: /// <returns>The probability that the tokens are from set A</returns>

155: public double CalculateProbabilityOfTokens(System.Collections.Generic.List<T> Items)

` 156: {`

157: SortedList<string, double> SortedProbabilities = new SortedList<string, double>();

158: for (int x = 0; x < Items.Count; ++x)

` 159: {`

160: double TokenProbability = 0.5;

161: if (Probabilities.ContainsKey(Items[x]))

` 162: {`

` 163: TokenProbability = Probabilities[Items[x]];`

` 164: }`

165: string Difference = ((0.5 - System.Math.Abs(0.5 - TokenProbability))).ToString(".0000000") + Items[x] + x;

` 166: SortedProbabilities.Add(Difference, TokenProbability);`

` 167: }`

168: double TotalProbability = 1;

169: double NegativeTotalProbability = 1;

170: int Count = 0;

171: int MaxCount=MathHelper.Min(SortedProbabilities.Count, MaxInterestingTokenCount);

172: foreach(string Probability in SortedProbabilities.Keys)

` 173: {`

174: double TokenProbability = SortedProbabilities[Probability];

` 175: TotalProbability *= TokenProbability;`

` 176: NegativeTotalProbability *= (1 - TokenProbability);`

` 177: ++Count;`

178: if (Count >= MaxCount)

179: break;

` 180: }`

181: return TotalProbability / (TotalProbability + NegativeTotalProbability);

` 182: }`

` 183: `

184: #endregion

` 185: `

186: #region Private Functions

` 187: `

188: /// <summary>

189: /// Calculates a single items probability of being in set A

190: /// </summary>

191: /// <param name="Item">Item to calculate</param>

192: /// <returns>The probability that the token is from set A</returns>

193: private double CalculateProbabilityOfToken(T Item)

` 194: {`

195: double Probability = 0;

196: int ACount = SetA.Contains(Item) ? SetA[Item] * ATokenWeight : 0;

197: int BCount = SetB.Contains(Item) ? SetB[Item] * BTokenWeight : 0;

198: if (ACount + BCount >= MinCountForInclusion)

` 199: {`

200: double AProbability=MathHelper.Min(1,(double)ACount/(double)TotalA);

201: double BProbability=MathHelper.Min(1,(double)BCount/(double)TotalB);

` 202: Probability = MathHelper.Max(MinTokenProbability, `

` 203: MathHelper.Min(MaxTokenProbability, AProbability / (AProbability + BProbability)));`

` 204: }`

205: return Probability;

` 206: }`

` 207: `

208: #endregion

` 209: }`

` 210: }`

The code above does a couple things. First it takes in two lists of strings (SetA being Spam and SetB being Ham). These lists would be individual words from the messages (although you can do phrases, word pairs, etc.). After those are added, you call CalculateTotals. This is simply calculating the number of times a word appears in the list and condensing the lists. After that, calling CalculateProbabilityOfTokens with a new set of strings will calculate the probability that the message falls under Set A (spam). The formula it uses to do that is the following:

P = Probability of individual word = Times in Set A /(Times in Set A + Times in Set B)

Probability of set of words = (P1 * P2 * P3 * ... * Pn) / ( (P1 * P2 * P3 * ... * Pn) + ( (1 - P1) * (1 - P2) * (1 - P3) * ... * (1 - Pn) ) )

That's it really. The algorithm above also takes into account the probability that an individual word is in the spam set and multiplies P by that, but to be honest it isn't needed (since it's most likely to be .5 anyway).

There are a couple of downsides to this approach:

- It takes a large data set of spam and non spam messages for this to be accurate (but it does get better with time).
- It can be tricked by using large amounts of nonsense words.
- It can be a bit slow without some sort of hashing of the words (you basically have to search each word which could total in the millions).
- There are better approaches out there now including SVMs...

## Comments

James CraigJuly 29, 2010 8:57 AM

Ok, I have a better understanding of what you're trying to do. You can use this technique to do what you want, but you would have to have two training sets. One that contains the keywords (set A) and one that does not (set B). However, you may end up with training sets that end up training the wrong thing. Plus you would have to make sure they were large enough. Coming up with those sets would be difficult at best. You may, instead, want to look into something simpler like relative frequency of the keywords.

Kasun

July 28, 2010 9:58 PM

thanx for ur kind reply James got ur point let me expalin using setA and setB, It like this The mails which is having keywords comes to SetA and dosent having keywords comes to SetB, (i am already getting all the key words from lexical database called Wordnet) I need to use probability (as u said "This will tell you how likely the items fall into set A (0.5 being the midway point as the scale is 1.0 to 0.0)". ) to find what r the mails which is more matching with given set of words. That s only need to classify mails. I think i can use ur algorithm for that purpose isn't it ? pls tell me what i am thinking is right or wrong, and how can i do it ?is there any other way to do it thanx again for ur kind help

James Craig

July 28, 2010 4:15 PM

I'm a bit confused on exactly what you are looking for, but I'll give a basic explanation based on what I think you're asking for. Naive Bayes is generally used for classifying something based on two groups (you can do more than two groups but my algorithm only does two). So your keywords would need to fall into one of two categories (spam vs nonspam, words that start with A vs those that start with B, etc.). You would then feed the first category to set A and then the second category to set B using LoadTokens (I'm assuming you're using the updated version of the code). Once you do that, you're set to receive input (in your case email).You take the email, break it apart into words, and then putting those words into a list. Feed the list into the class's CalculateProbabilityOfTokens function. This will tell you how likely the items fall into set A (0.5 being the midway point as the scale is 1.0 to 0.0).Now it sounds like that isn't what you want. I'm unsure if you want the prob

Kasun

July 28, 2010 11:53 AM

Hi James Thank you very much for replay pls tell me how can i use ur algorithm for my programme let me explain it I am having keywords (more than one ) and having txt versions of mail body contents. wht i need to do is get body contents which keywords disply more time (using probability) i need to separte them by using ur algorithm pls help

James Craig

July 26, 2010 8:53 AM

You can use it, everything on my site uses either BSD or MIT license so you can use it for whatever you want. I would however suggest you use a more updated version:http://cul.codeplex.com/SourceControl/changeset/view/56586#707960It uses a container called a Bag (basically just counts the number of times something appears). Makes the rest of the code a bit simpler.Anyway, I don't really have the time to help you out that much. That being said, if you just search for something like "Bayesian spam filtering C#", you'll get a ton of items that will help. The other thing you'll need is an archive. Searching for "Spam Archive" will give you a good number of items to use (the first one that popped up was this one: http://untroubled.org/spam/ ). And lastly you'll need a set of non spam emails (hopefully you don't delete every message you get like I do). Just save up every email that you get for a while.

Kasun

July 26, 2010 4:18 AM

Hi Jamesi am developing e-mail filter using NaiveBayes classifier for my final year project , so can i use ur code for it ? can u pls guide me for it ? Ur help will be most appreciated Thank you very much