Linköping University
Department of Mathematics
Lars Eldén
June 2009
Matrix methods in Data Mining and Pattern Recognition
Computer assignment
Spam Classification with PLS/LGK bidiagonalization
ASSIGNMENT
Construct an algorithm in MATLAB for spam classification,
where the data are collected in a matrix with a number of features prepared
from a set of 4601 e-mail messages. Use PLS/LGK bidiagonalization.
SPECIFIC TASKS
- Write a function that implements LGK bidiagonalization.
Indata: Matrix X, right hand side y and the number of steps k.
Outdata: Basis matrices P_k, Z_k, bidiagonal matrix B_k, approximate least squares solution x,
and relative residual for step k.
- Perform a random selection of the spam (and similarly for the non-spam) so that you get
two equally sized parts, and
use one half as training set and the other as test set.
- Plot the percentage of correctly classified spam and non-spam as functions of the number of
LGK steps.
DATA
The data are available at
http://www.mai.liu.se/~laeld/matrix-methods/computer-assignments/spam/. They are described in the file spam.txt.