November, 2009
Prepare a term-document matrix and queries for the Medline data base using TMG. Experiment with stemming and removal of stop words to see that the matrix dimensions vary. In your experiments below always use stemming and removal of stop words. Make experiments with the options min_local_freq and min_global_freq and see how the dimensions of the matrix vary. Explain the variations and why it is reasonable to use the values 1 and 2 (for instance) for the two variables, respectively. Also test a couple of the weighting schemes together with the vector space model.
There is also a script stem available that does stemming. Unix/Linux syntax: ./stem file > file.stemmed'. If you are interested, compare its results with the built-in stemming algorithm. What about the stop list, should it also be stemmed?
Compute the sparse singular value decomposition of the term document matrix using svds in Matlab, and check the average recall/precision of the reduced rank models for a few ranks. Choose ranks e.g. from 100 and smaller (Note that you only need to compute the SVD once). For one of the ranks, e.g. 100, run the same experiment with the addition that you normalize the columns of the matrix before the SVD is computed.
Code the k-means algorithm and cluster the documents using k between 50 and 100. Check what are the dominant words in a few clusters. Orthonormalize the centroids and compute the coordinates of all documents in this basis. Perform query matching and compute the average recall/precision.
TMG can be obtained via the TMG
project web page.