TMG - Text to Matrix Generator TMG parses a text collection and generates the term - document matrix. A = TMG(FILENAME) returns the term - document matrix, that corresponds to the text collection contained in files of directory (or file) FILENAME. Each document must be separeted by a blank line (or another delimiter that is defined by OPTIONS argument) in each file. [A, DICTIONARY] = TMG(FILENAME) returns also the dictionary for the collection, while [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZED_FACTORS] = TMG(FILENAME) returns the vectors of global weights for the dictionary and the normalization factor for each document in case such a factor is used. If normalization is not used TMG returns a vector of all ones. [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC] = TMG(FILENAME) returns statistics for each document, i.e. the number of terms for each document. [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC, TITLES, FILES] = TMG(FILENAME) returns in FILES the filenames contained in directory (or file) FILENAME and a cell array (TITLES) that containes a declaratory title for each document, as well as the document's first line. Finally [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC, TITLES, FILES, UPDATE_STRUCT] = TMG(FILENAME) returns a structure that keeps the essential information for the collection' s update (or downdate). TMG(FILENAME, OPTIONS) defines optional parameters: - OPTIONS.delimiter: The delimiter between documents within the same file. Possible values are 'emptyline' (default), 'none_delimiter' (treats each file as a single document) or any other string - OPTIONS.line_delimiter: Defines if the delimiter takes a whole line of text (default, 1) or not. - OPTIONS.stoplist: The filename for the stoplist, i.e. a list of common words that we don't use for the indexing (default no stoplist used) - OPTIONS.stemming: Indicates if the stemming algorithm is used (1) or not (0 - default) - OPTIONS.min_length: The minimum length for a term (default 3) - OPTIONS.max_length: The maximum length for a term (default 30) - OPTIONS.min_local_freq: The minimum local frequency for a term (default 1) - OPTIONS.max_local_freq: The maximum local frequency for a term (default inf) - OPTIONS.min_global_freq: The minimum global frequency for a term (default 1) - OPTIONS.max_global_freq: The maximum global frequency for a term (default inf) - OPTIONS.local_weight: The local term weighting function (default 't'). Possible values (see [1, 2]): 't': Term Frequency 'b': Binary 'l': Logarithmic 'a': Alternate Log 'n': Augmented Normalized Term Frequency - OPTIONS.global_weight: The global term weighting function (default 'x'). Possible values (see [1, 2]): 'x': None 'e': Entropy 'f': Inverse Document Frequency (IDF) 'g': GfIdf 'n': Normal 'p': Probabilistic Inverse - OPTIONS.normalization: Indicates if we normalize the document vectors (default 'x'). Possible values: 'x': None 'c': Cosine - OPTIONS.dsp: Displays results (default 1) or not (0) to the command window REFERENCES: [1] M.Berry and M.Browne, Understanding Search Engines, Mathematical Modeling and Text Retrieval, Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999. [2] T.Kolda, Limited-Memory Matrix Methods with Applications, Tech.Report CS-TR-3806, 1997. Copyright 2004 Dimitrios Zeimpekis, Efstratios Gallopoulos