In this example of text analysis we consider the text processing application inspired by [Albright2006].
We used the Medlars data set, which is a collection of 1033 medical abstracts. For example we performed factorization on termbydocument matrix by constructing a matrix of shape 4765 (terms) x 1033 (documents). Original number of terms is 16017, the reduced number is a result of text preprocessing, namely removing stop words, too short words, words that appear 2 times or less in the corpus and words that appear 50 times or more.
Note
Medlars data set of medical abstracts used in this example is not included in the datasets and need to be downloaded. Download links are listed in the datasets. Download compressed version of document text. To run the example, the extracted Medlars data set must exist in the Medlars directory under datasets.
Example of medical abstract:
autolysis of bacillus subtilis by glucose depletion .
in cultures in minimal medium, rapid lysis of cells of bacillus
subtilis was observed as soon as the carbon source, e.g. glucose, had
been completely consumed . the cells died and ultravioletabsorbing
material was excreted in the medium . the results suggest that the cells
lyse because of the presence of autolytic enzymes . in the presence of
glucose the damage to the cell wall caused by these enzymes is repaired
immediately .
Because of the nature of analysis, the resulting data matrix is very sparse. Therefore we use scipy.sparse matrix formats in factorization. This results in lower space consumption. Using, Standard NMF  Divergence, fitted factorization model is sparse as well, according to [Hoyer2004] measure of sparseness, the basis matrix has sparseness of 0.641 and the mixture matrix 0.863.
Note
This sparseness measure quantifies how much energy of a vector is packed into only few components. The sparseness of a vector is a real number in [0, 1]. Sparser vector has value closer to 1. The measure is 1 iff vector contains single nonzero component and the measure is equal to 0 iff all components are equal. Sparseness of a matrix is the mean sparseness of its column vectors.
The configuration of this example is sparse data matrix with Standard NMF  Divergence factorization method using Random Vcol algorithm for initialization and rank 15 (the number of hidden topics).
Because of nonnegativity constraints, NMF has impressive benefits in terms of interpretation of its factors. In text processing applications, factorization rank can be considered the number of hidden topics present in the document collection. The basis matrix becomes a termbytopic matrix whose columns are the basis vectors. Similar interpretation holds for the other factor, mixture matrix. Mixture matrix is a topicbydocument matrix with sparse nonnegative columns. Element j of column 1 of mixture matrix measures the strength to which topic j appears in document 1.
To run the example simply type:
python documents.py
or call the module’s function:
import nimfa.examples
nimfa.examples.documents.run()
Note
This example uses matplotlib library for producing visual interpretation of NMF basis vectors on Medlars data set.
Perform NMF  Divergence factorization on the sparse Medlars data matrix.
Return basis and mixture matrices of the fitted factorization model.
Parameters:  V (scipy.sparse.csr_matrix) – The Medlars data matrix. 

Plot the interpretation of NMF basis vectors on Medlars data set.
Parameters: 


Preprocess Medlars data matrix. Remove stop words, digits, too short words, words that appear 2 times or less in the corpus and words that appear 50 times or more.
Return preprocessed termbydocument sparse matrix in CSR format. Returned matrix’s shape is 4765 (terms) x 1033 (documents). The sparse data matrix is converted to CSR format for fast arithmetic and matrix vector operations. Return updated indextoterm and termtoindex translators.
Parameters: 


Read medical abstracts data from Medlars data set.
Construct a termbydocument matrix. This matrix is sparse, therefore scipy.sparse format is used. For construction LIL sparse format is used, which is an efficient structure for constructing sparse matrices incrementally.
Return the Medlars sparse data matrix in LIL format, termtoindex dict translator and indextoterm dict translator.
Run NMF  Divergence on the Medlars data set.