Linköping University
Department of Mathematics
Lars Eldén

November, 2009                                                     

 

Matrix Methods in Data Mining and Pattern Recognition

Computer Assignment

Text mining and information retrieval




ASSIGNMENT

Perform textmining experiments on the Medline collection using LSI and clustering. It is assumed that Matlab toolbox TMG is available.

SPECIFIC TASKS

  1. Copy the data and code files from the URL given below. Start MATLAB, and write help tmg in MATLAB to see the parser syntax. The HELP information is also available at HELP-TMG.

  2. Prepare a term-document matrix and queries for the Medline data base using  TMG. Experiment with stemming and removal of stop  words to see that the matrix dimensions vary. In your experiments below always use stemming and removal of stop words.  Make experiments with the options min_local_freq and min_global_freq and see how the dimensions of the matrix vary. Explain the variations and why it is reasonable to use the values 1 and 2 (for instance) for the two variables, respectively. Also test a couple of the weighting schemes together with the vector space model.

    There is also a script stem available that does stemming. Unix/Linux syntax: ./stem file > file.stemmed'. If you are interested, compare its results with the built-in stemming algorithm. What about the stop list, should it also be stemmed?

  3. In your first experiments with the vector space model use only one query and compute precision and recall for a suitable sequence of values of the tolerance for distance. Plot a recall/precision diagram. When that works fine, compute and illustrate the average recall/precision over the 30 queries using  the full vector space model.  Use the Matlab function aver that is also available at the URL below.

  4. Compute the sparse singular value decomposition  of the term document matrix using svds in Matlab, and check the average recall/precision of the reduced rank models for a few ranks. Choose ranks e.g. from 100 and smaller (Note that you only need to compute the SVD once). For one of the ranks, e.g. 100, run the same experiment with the addition that you normalize the columns of the matrix before the SVD is computed.    

  5. Code the k-means algorithm and cluster the documents using k between 50 and 100. Check what are the dominant words in a few clusters. Orthonormalize the centroids and compute the coordinates of all documents in this basis. Perform query matching and compute the average recall/precision.          

DATA

Codes and data are available here:
aver.m
med.rel
stem
common_words
MED.Q.and.A
modrec.m/
The data are given in the file MED.Q.and.A. The 30  first columns in the term-document matrix are the queries, the remaning 1033 are the documents. The "correct results" of the queries are given in  MED.REL.

TMG can be obtained via the TMG project web page.