In this example of text analysis we consider the text processing application inspired by [Albright2006].
We used the Medlars data set, which is a collection of 1033 medical abstracts. For example we performed factorization on term-by-document matrix by constructing a matrix of shape 4765 (terms) x 1033 (documents). Original number of terms is 16017, the reduced number is a result of text preprocessing, namely removing stop words, too short words, words that appear 2 times or less in the corpus and words that appear 50 times or more.
Medlars data set of medical abstracts used in this example is not included in the datasets and need to be downloaded. Download links are listed in the datasets. Download compressed version of document text. To run the example, the extracted Medlars data set must exist in the Medlars directory under datasets.
Example of medical abstract:
autolysis of bacillus subtilis by glucose depletion . in cultures in minimal medium, rapid lysis of cells of bacillus subtilis was observed as soon as the carbon source, e.g. glucose, had been completely consumed . the cells died and ultraviolet-absorbing material was excreted in the medium . the results suggest that the cells lyse because of the presence of autolytic enzymes . in the presence of glucose the damage to the cell wall caused by these enzymes is repaired immediately .
Because of the nature of analysis, the resulting data matrix is very sparse. Therefore we use scipy.sparse matrix formats in factorization. This results in lower space consumption. Using, Standard NMF - Divergence, fitted factorization model is sparse as well, according to [Hoyer2004] measure of sparseness, the basis matrix has sparseness of 0.641 and the mixture matrix 0.863.
This sparseness measure quantifies how much energy of a vector is packed into only few components. The sparseness of a vector is a real number in [0, 1]. Sparser vector has value closer to 1. The measure is 1 iff vector contains single nonzero component and the measure is equal to 0 iff all components are equal. Sparseness of a matrix is the mean sparseness of its column vectors.
The configuration of this example is sparse data matrix with Standard NMF - Divergence factorization method using Random Vcol algorithm for initialization and rank 15 (the number of hidden topics).
Because of nonnegativity constraints, NMF has impressive benefits in terms of interpretation of its factors. In text processing applications, factorization rank can be considered the number of hidden topics present in the document collection. The basis matrix becomes a term-by-topic matrix whose columns are the basis vectors. Similar interpretation holds for the other factor, mixture matrix. Mixture matrix is a topic-by-document matrix with sparse nonnegative columns. Element j of column 1 of mixture matrix measures the strength to which topic j appears in document 1.
To run the example simply type:
or call the module’s function:
import nimfa.examples nimfa.examples.documents.run()
This example uses matplotlib library for producing visual interpretation of NMF basis vectors on Medlars data set.
Perform NMF - Divergence factorization on the sparse Medlars data matrix.
Return basis and mixture matrices of the fitted factorization model.
|Parameters:||V (scipy.sparse.csr_matrix) – The Medlars data matrix.|
Plot the interpretation of NMF basis vectors on Medlars data set.
Preprocess Medlars data matrix. Remove stop words, digits, too short words, words that appear 2 times or less in the corpus and words that appear 50 times or more.
Return preprocessed term-by-document sparse matrix in CSR format. Returned matrix’s shape is 4765 (terms) x 1033 (documents). The sparse data matrix is converted to CSR format for fast arithmetic and matrix vector operations. Return updated index-to-term and term-to-index translators.
Read medical abstracts data from Medlars data set.
Construct a term-by-document matrix. This matrix is sparse, therefore scipy.sparse format is used. For construction LIL sparse format is used, which is an efficient structure for constructing sparse matrices incrementally.
Return the Medlars sparse data matrix in LIL format, term-to-index dict translator and index-to-term dict translator.
Run NMF - Divergence on the Medlars data set.