Linköping University 
Department of Mathematics
Lars Eldén

June 2009


Matrix methods in Data Mining and Pattern Recognition

Computer assignment

Spam Classification with PLS/LGK bidiagonalization





ASSIGNMENT

Construct an algorithm in MATLAB for spam classification, where the data are collected in a matrix with a number of features prepared from a set of 4601 e-mail messages. Use PLS/LGK bidiagonalization.

SPECIFIC TASKS

  1. Write a function that implements LGK bidiagonalization.
    Indata: Matrix X, right hand side y and the number of steps k.
    Outdata: Basis matrices P_k, Z_k, bidiagonal matrix B_k, approximate least squares solution x, and relative residual for step k.
  2. Perform a random selection of the spam (and similarly for the non-spam) so that you get two equally sized parts, and use one half as training set and the other as test set.
  3. Plot the percentage of correctly classified spam and non-spam as functions of the number of LGK steps.

DATA

The data are available at http://www.mai.liu.se/~laeld/matrix-methods/computer-assignments/spam/. They are described in the file spam.txt.