Classify cover type

Linköping University
Department of Mathematics
Lars Eldén

June 2009

Matrix Methods in Data Mining and Pattern Recognition

Computer project

Classification of Forest Cover Type

ASSIGNMENT

Construct an algorithm in MATLAB for classification of forest cover type. The data set is quite large, 581 012 observations, so the purpose of the project is to experiment with the sizes of test set and training set to see how the performance varies with size. Is the SVD-based method applicable? Can one use the function classregtree from the statistics toolbox in MATLAB? Are there other algorithms that might be better?

SPECIFIC TASKS

The tasks below are examples, it is not required that you do everything (except the random selection that should always be done). And if you have your own ideas, go ahead and try.

Design and algorithm that is like the vector space model: for each test vector find the closest training vector (cosine or Euclidean distance?), and classify according to that.
Tune the SVD algorithm for accuracy of classification.
Check if all forest types are equally easy or difficult to classify.
Does it help to scale the data?
Investigate the properties of the matrix of observations using the SVD, and perhaps other tools.
Most of the attributes are qualitative. One may treat them as quantitative or ignore them. Does it make a difference?

When you divide the set in training and tests sets, make a random selection so that you can be rather sure that you get representative sets.

DATA

The test data covtype.data are available at http://www.mai.liu.se/~laeld/matrix-methods/computer-assignments/cover-type/.

The training and test data are described in the file covtype.info Note that these data are quite difficult to classify (e.g., 70% correct with neural networks and 58% with linear discriminant analysis).