TMG - Text to Matrix Generator
    TMG parses a text collection and generates the term - document matrix.
    A = TMG(FILENAME) returns the term - document matrix, that corresponds
    to the text collection contained in files of directory (or file) FILENAME.
    Each document must be separeted by a blank line (or another delimiter
    that is defined by OPTIONS argument) in each file.
    [A, DICTIONARY] = TMG(FILENAME) returns also the dictionary for the
    collection, while [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZED_FACTORS]
    = TMG(FILENAME) returns the vectors of global weights for the dictionary
    and the normalization factor for each document in case such a factor is used.
    If normalization is not used TMG returns a vector of all ones.
    [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC] =
    TMG(FILENAME) returns statistics for each document, i.e. the number of
    terms for each document.
    [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC,
    TITLES, FILES] = TMG(FILENAME) returns in FILES the filenames contained
    in directory (or file) FILENAME and a cell array (TITLES) that containes a
    declaratory title for each document, as well as the document's first line.
    Finally [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC,
    TITLES, FILES, UPDATE_STRUCT] = TMG(FILENAME) returns a structure that
    keeps the essential information for the collection' s update (or downdate).

    TMG(FILENAME, OPTIONS) defines optional parameters:
        - OPTIONS.delimiter: The delimiter between documents within the
          same file. Possible values are 'emptyline' (default), 'none_delimiter'
          (treats each file as a single document) or any other string
        - OPTIONS.line_delimiter: Defines if the delimiter takes a whole
          line of text (default, 1) or not.
        - OPTIONS.stoplist: The filename for the stoplist, i.e. a list of
          common words that we don't use for the indexing (default no
          stoplist used)
        - OPTIONS.stemming: Indicates if the stemming algorithm is used (1) or
          not (0 - default)
        - OPTIONS.min_length: The minimum length for a term (default 3)
        - OPTIONS.max_length: The maximum length for a term (default 30)
        - OPTIONS.min_local_freq: The minimum local frequency for a term
          (default 1)
        - OPTIONS.max_local_freq: The maximum local frequency for a term
          (default inf)
        - OPTIONS.min_global_freq: The minimum global frequency for a term
          (default 1)
        - OPTIONS.max_global_freq: The maximum global frequency for a term
          (default inf)
        - OPTIONS.local_weight: The local term weighting function (default
          't'). Possible values (see [1, 2]):
                't': Term Frequency
                'b': Binary
                'l': Logarithmic
                'a': Alternate Log
                'n': Augmented Normalized Term Frequency
        - OPTIONS.global_weight: The global term weighting function (default
          'x'). Possible values (see [1, 2]):
                'x': None
                'e': Entropy
                'f': Inverse Document Frequency (IDF)
                'g': GfIdf
                'n': Normal
                'p': Probabilistic Inverse
        - OPTIONS.normalization: Indicates if we normalize the document vectors
          (default 'x'). Possible values:
                'x': None
                'c': Cosine
        - OPTIONS.dsp: Displays results (default 1) or not (0) to the
          command window

    REFERENCES:
    [1] M.Berry and M.Browne, Understanding Search Engines, Mathematical Modeling
    and Text Retrieval, Philadelphia, PA: Society for Industrial and
    Applied Mathematics, 1999.
    [2] T.Kolda, Limited-Memory Matrix Methods with Applications,
    Tech.Report CS-TR-3806, 1997.

  Copyright 2004 Dimitrios Zeimpekis, Efstratios Gallopoulos