Python File Format For Email Classification With Svm-light
Solution 1:
If you make a feature out of each word, create a list of all unique words w(1)..w(n). Now feature(i) gets the value 1 if w(i) exists in the sample you are examining. (You could also make the value be equal to the number of occurrences, so that a feature which occurs multiple times gets more weight.)
Assuming the following samples;
1 My hovercraft is full of eels
2 Your account is suspended
3 This is it!
... you could extract the following dictionary;
001 My
002 hovercraft
003is
:
:
009 suspended
010 This
011 it!
(The leading zeros are just to make the features look different than the other numbers in this exposition. Normally there should probably not be any leading zeros.)
The features for sample 1 are 001 through 006; for sample 3 they are 010, 003, and 011. The other features get the value 0. So the full representation of sample 3 would look like
3 001:0 002:0 003:1 004:0 005:0 ...
(though I don't think you need to specify the zero, i.e. absent, features).
However, given the small sample size (just subjects), it's unlikely that you get very good results. Perhaps you'd be better off using e.g. bigram or trigram features (split each word using a sliding window; tri, rig, igr, gra, ram).
I don't think it makes sense to try to mix tf-idf with SVM, they are different approaches to the same fundamental problem.
Post a Comment for "Python File Format For Email Classification With Svm-light"