Linköping University 
Department of Mathematics
Lars Eldén

April 2007



Matrix Methods in Data Mining and Pattern Recognition

Computer assignment

Summarization of a text: extraction of key words and key sentences 






ASSIGNMENT

Compute the key words and key sentences of a newspaper text using the saliency score method. Construct term-sentence matrices using the text parser tmg.

SPECIFIC TASKS

  1. Compute key words and key sentences for the article BBC-google.txt. Perform experiments with stemming using both the built in tmg stemmer and the script stem. Also experiment with removal of stop words.

  2. tmg has an option for normalizing the term-sentence matrix. Compare your results with and without this options.

  3. Download newspaper articles of your choice, and perform similar experiments.

  4. In Chapter 13 of the book, experiments were performed without normalizing the term-sentence matrix. Compute the key words and key sentences of the file pager-detex.stemmed, normalize the matrix and compare with the results in the book. Which are more representative?

DATA

Data and the stemming script are available here:
BBC-google.txt
common_words
common_words.stemmed
pager-detex.stemmed
stem

TMG can be obtained via the TMG project web page.