Text mining and information retrieval

ASSIGNMENT

Perform textmining experiments on the Medline collection using LSI and clustering. It is assumed that Matlab toolbox TMG is available.

SPECIFIC TASKS

Copy the data and code files from the URL given below. Start MATLAB, and write help tmg in MATLAB to see the parser syntax. The HELP information is also available at HELP-TMG.
Prepare a term-document matrix and queries for the Medline data base using TMG. Experiment with stemming and removal of stop words to see that the matrix dimensions vary. In your experiments below always use stemming and removal of stop words. Make experiments with the options min_local_freq and min_global_freq and see how the dimensions of the matrix vary. Explain the variations and why it is reasonable to use the values 1 and 2 (for instance) for the two variables, respectively. Also test a couple of the weighting schemes together with the vector space model.

There is also a script stem available that does stemming. Unix/Linux syntax: ./stem file > file.stemmed'. If you are interested, compare its results with the built-in stemming algorithm. What about the stop list, should it also be stemmed?
In your first experiments with the vector space model use only one query and compute precision and recall for a suitable sequence of values of the tolerance for distance. Plot a recall/precision diagram. When that works fine, compute and illustrate the average recall/precision over the 30 queries using the full vector space model. Use the Matlab function aver that is also available at the URL below.
Compute the sparse singular value decomposition of the term document matrix using svds in Matlab, and check the average recall/precision of the reduced rank models for a few ranks. Choose ranks e.g. from 100 and smaller (Note that you only need to compute the SVD once). For one of the ranks, e.g. 100, run the same experiment with the addition that you normalize the columns of the matrix before the SVD is computed.
Code the k-means algorithm and cluster the documents using k between 50 and 100. Check what are the dominant words in a few clusters. Orthonormalize the centroids and compute the coordinates of all documents in this basis. Perform query matching and compute the average recall/precision.

DATA

Codes and data are available here:
aver.m
med.rel
stem
common_words
MED.Q.and.A
modrec.m/
The data are given in the file MED.Q.and.A. The 30 first columns in the term-document matrix are the queries, the remaning 1033 are the documents. The "correct results" of the queries are given in MED.REL.

TMG can be obtained via the TMG project web page.