Linköping University
Department of Mathematics
Lars Eldén
June 2009
Matrix Methods in Data Mining and Pattern Recognition
Computer project
Classification of Forest Cover Type
ASSIGNMENT
Construct an algorithm in MATLAB for classification
of forest cover type. The data set is quite large, 581 012 observations, so the purpose of the project
is to experiment with the sizes of test set and training set to see how the performance varies with size.
Is the SVD-based method applicable? Can one use the function classregtree
from the statistics toolbox in MATLAB? Are there other algorithms that might be
better?
SPECIFIC TASKS
The tasks below are examples, it is not required that you do everything (except the random
selection that should always be done). And if you have your own ideas, go ahead and try.
- Design and algorithm that is like the vector space model: for each test vector find
the closest training vector (cosine or Euclidean distance?), and classify
according to that.
- Tune the SVD algorithm for accuracy of classification.
- Check if all forest types are equally easy or difficult to classify.
- Does it help to scale the data?
- Investigate the properties of the matrix of observations using the SVD, and
perhaps other tools.
- Most of the attributes are qualitative. One may treat them as quantitative or ignore them.
Does it make a difference?
When you divide the set in training and tests sets, make a random selection so that you can
be rather sure that you get representative sets.
DATA
The test data covtype.data are available at
http://www.mai.liu.se/~laeld/matrix-methods/computer-assignments/cover-type/.
The training and test data are described in the file covtype.info Note that these
data are quite difficult to classify (e.g., 70% correct with neural networks and 58%
with linear discriminant analysis).