nimfa - A Python Library for Nonnegative Matrix Factorization Techniques

Nimfa is an open-source Python library that provides a unified interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. Both dense and sparse matrix representation are supported.

The sample script using Nimfa on medulloblastoma gene expression data is given below. It uses alternating least squares nonnegative matrix factorization with projected gradient method for subproblems [Lin2007] and Random Vcol [Albright2006] initialization algorithm. An object returned by nimfa.mf_run is fitted factorization model through which user can access matrix factors and estimate quality measures.

import nimfa

V = nimfa.examples.medulloblastoma.read(normalize = True)

fctr = nimfa.mf(V, seed = 'random_vcol', method = 'lsnmf', rank = 40, max_iter = 65)
fctr_res = nimfa.mf_run(fctr)

print 'Rss: %5.4f' % fctr_res.fit.rss()
print 'Evar: %5.4f' % fctr_res.fit.evar()
print 'K-L divergence: %5.4f' % fctr_res.distance(metric = 'kl')
print 'Sparseness, W: %5.4f, H: %5.4f' % fctr_res.fit.sparseness()

Running this script produces the following output, where slight differences in reported scores across different runs can be attributed to randomness of the Random Vcol initialization method:

Rss: 0.2668
Evar: 0.9997
K-L divergence: 38.8744
Sparseness, W: 0.7297, H: 0.8796

Further real-world examples and demonstrations of library capabilities are provided in the Examples section and in sample scripts on this page.

Content

Matrix Factorization Methods

  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning local features [Wang2004], [Li2001]
  • LSNMF - Alternating nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network-regularized multiple nonnegative matrix factorization [Zhang2011]
  • PMFCC - Penalized Matrix Factorization for Constrained Clustering [FWang2008]

Initialization Methods

Quality Measures

  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares (rank estimation) [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix (rank estimation) [Brunet2004]
  • Dispersion [Park2007]
  • Factorization rank estimation
  • Selected matrix factorization method specific

Utils

  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs

Installation

Source

No special installation procedure is specified. However, the nimfa library makes extensive use of SciPy and NumPy libraries for fast and convenient dense and sparse matrix manipulation and some linear algebra operations. There are not any additional prerequisites.

Download source code (zipped archive) from Github repository.

Unzip the archive. To build and install run:

python setup.py install

Documentation

For building the documentation use Sphinx 1.0 or newer. Sphinx is available at Sphinx home page and nimfa library documentation sources are available at Github repository. Before building documentation, please install the nimfa library.

Documentation can be built by issuing:

sphinx-build -b html -E <source dir [docs/source]>  <build dir [html]>

from the nimfa root directory. After completion, the documentation should reside in the html directory. The built documentation should resemble closely the one you are currently reading.

Note

The nimfa library documentation is contained in docs source directory and and scripts are in nimfa directory.

Configuration

Methods configuration goes through:

  1. runtime specific options (e. g. tracking fitted model across multiple runs, tracking residuals across iterations, etc.);
  2. algorithm specific options (e. g. prior information with PSMF, type of update equations with NMF, initial value for noise variance with ICM, etc.).

For details and descriptions on algorithm specific options see specific algorithm documentation. For details on runtime specific options and explanation of the general model parameters see mf_run.

Usage

Following are basic usage examples that employ different implemented factorization algorithms. For examples with real world applications see Examples section and methods’ documentation.

Note

Consider these examples as Hello World examples for the nimfa library.

Standard NMF - Divergence on scipy.sparse matrix with matrix factors estimation.

# Import nimfa library entry point for factorization
import nimfa

# Construct sparse matrix in CSR format, which will be our input for factorization
from scipy.sparse import csr_matrix
from scipy import array
from numpy import dot
V = csr_matrix((array([1,2,3,4,5,6]), array([0,2,2,0,1,2]), array([0,2,3,6])), shape=(3,3))

# Print this tiny matrix in dense format
print V.todense()

# Run Standard NMF rank 4 algorithm
# Update equations and cost function are Standard NMF specific parameters (among others).
# If not specified the Euclidean update and Frobenius cost function would be used.
# We don't specify initialization method. Algorithm specific or random initialization will be used.
# In Standard NMF case, by default random is used.
# Returned object is fitted factorization model. Through it user can access quality and performance measures.
# The fctr_res's attribute `fit` contains all the attributes of the factorization.
fctr = nimfa.mf(V, method = "nmf", max_iter = 30, rank = 4, update = 'divergence', objective = 'div')
fctr_res = nimfa.mf_run(fctr)

# Basis matrix. It is sparse, as input V was sparse as well.
W = fctr_res.basis()
print "Basis matrix"
print W.todense()

# Mixture matrix. We print this tiny matrix in dense format.
H = fctr_res.coef()
print "Coef"
print H.todense()

# Return the loss function according to Kullback-Leibler divergence. By default Euclidean metric is used.
print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric = "kl")

# Compute generic set of measures to evaluate the quality of the factorization
sm = fctr_res.summary()
# Print sparseness (Hoyer, 2004) of basis and mixture matrix
print "Sparseness Basis: %5.3f  Mixture: %5.3f" % (sm['sparseness'][0], sm['sparseness'][1])
# Print actual number of iterations performed
print "Iterations: %d" % sm['n_iter']

# Print estimate of target matrix V
print "Estimate"
print dot(W.todense(), H.todense())

LSNMF on numpy dense matrix with quality and performance measures.

# Import nimfa library entry point for factorization
import nimfa

# Here we will work with numpy matrix
import numpy as np
V = np.matrix([[1,2,3],[4,5,6],[6,7,8]])

# Print this tiny matrix 
print V

# Run LSNMF rank 3 algorithm
# We don't specify any algorithm specific parameters. Defaults will be used.
# We don't specify initialization method. Algorithm specific or random initialization will be used. 
# In LSNMF case, by default random is used.
# Returned object is fitted factorization model. Through it user can access quality and performance measures.
# The fctr_res's attribute `fit` contains all the attributes of the factorization.  
fctr = nimfa.mf(V, method = "lsnmf", max_iter = 10, rank = 3)
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Basis matrix"
print W

# Mixture matrix. 
H = fctr_res.coef()
print "Coef"
print H

# Print the loss function according to Kullback-Leibler divergence. By default Euclidean metric is used.
print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric = "kl")

# Compute generic set of measures to evaluate the quality of the factorization
sm = fctr_res.summary()
# Print residual sum of squares (Hutchins, 2008). Can be used for estimating optimal factorization rank.
print "Rss: %8.3f" % sm['rss']
# Print explained variance.
print "Evar: %8.3f" % sm['evar']
# Print actual number of iterations performed
print "Iterations: %d" % sm['n_iter']

# Print estimate of target matrix V 
print "Estimate"
print np.dot(W, H)

LSNMF with Random VCol initialization and error tracking.

# Import nimfa library entry point for factorization
import nimfa

# Here we will work with numpy matrix
import numpy as np
V = np.matrix([[1,2,3],[4,5,6],[6,7,8]])

# Print this tiny matrix 
print V

# Run LSNMF rank 3 algorithm
# We don't specify any algorithm specific parameters. Defaults will be used.
# We specify Random V Col initialization algorithm. 
# We enable tracking the error from each iteration of the factorization, by default only the final value of objective function is retained. 
# Perform initialization. 
fctr = nimfa.mf(V, seed = "random_vcol", method = "lsnmf", max_iter = 10, rank = 3, track_error = True)

# Returned object is fitted factorization model. Through it user can access quality and performance measures.
# The fctr_res's attribute `fit` contains all the attributes of the factorization.  
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Basis matrix"
print W

# Mixture matrix. 
H = fctr_res.coef()
print "Coef"
print H

# Error tracking. 
print "Error tracking"
# A list of objective function values for each iteration in factorization is printed.
# If error tracking is enabled and user specifies multiple runs of the factorization, get_error(run = n) return a list of objective values from n-th run. 
# fctr_res.fit.tracker is an instance of Mf_track -- isinstance(fctr_res.fit.tracker, nimfa.models.mf_track.Mf_track)
print fctr_res.fit.tracker.get_error()

# Compute generic set of measures to evaluate the quality of the factorization
sm = fctr_res.summary()
# Print residual sum of squares (Hutchins, 2008). Can be used for estimating optimal factorization rank.
print "Rss: %8.3f" % sm['rss']
# Print explained variance.
print "Evar: %8.3f" % sm['evar']
# Print actual number of iterations performed
print "Iterations: %d" % sm['n_iter']

ICM with Random C initialization and passed callback initialization function.

# Import nimfa library entry point for factorization
import nimfa

# Here we will work with numpy matrix
import numpy as np
V = np.matrix([[1,2,3],[4,5,6],[6,7,8]])

# Print this tiny matrix 
print V


# This will be our callback_init function called prior to factorization.
# We will only print the initialized matrix factors.
def init_info(model):
    print "Initialized basis matrix\n", model.basis()
    print "Initialized  mixture matrix\n", model.coef() 

# Run ICM rank 3 algorithm
# We don't specify any algorithm specific parameters. Defaults will be used.
# We specify Random C initialization algorithm.
# We specify callback_init parameter by passing a init_info function 
# This function is called after initialization and prior to factorization in each run.  
fctr = nimfa.mf(V, seed = "random_c", method = "icm", max_iter = 10, rank = 3, callback_init = init_info)

# Returned object is fitted factorization model. Through it user can access quality and performance measures.
# The fctr_res's attribute `fit` contains all the attributes of the factorization.  
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Resulting basis matrix"
print W

# Mixture matrix. 
H = fctr_res.coef()
print "Resulting mixture matrix"
print H

# Compute generic set of measures to evaluate the quality of the factorization
sm = fctr_res.summary()
# Print residual sum of squares (Hutchins, 2008). Can be used for estimating optimal factorization rank.
print "Rss: %8.3e" % sm['rss']
# Print explained variance.
print "Evar: %8.3e" % sm['evar']
# Print actual number of iterations performed
print "Iterations: %d" % sm['n_iter']
# Print distance according to Kullback-Leibler divergence
print "KL divergence: %5.3e" % sm['kl']
# Print distance according to Euclidean metric
print "Euclidean distance: %5.3e" % sm['euclidean'] 

BMF with default parameters, multiple runs and factor tracking for consensus matrix computation.

# Import nimfa library entry point for factorization.
import nimfa

# Here we will work with numpy matrix.
import numpy as np
V = np.random.random((23, 200))

# Run BMF.
# We don't specify any algorithm parameters or initialization method. Defaults will be used.
# Factorization will be run 3 times (n_run) and factors will be tracked for computing 
# cophenetic correlation. Note increased time and space complexity.
fctr = nimfa.mf(V, method = "bmf", max_iter = 10, rank = 30, n_run = 3, track_factor = True)
fctr_res = nimfa.mf_run(fctr)

# Print the loss function according to Kullback-Leibler divergence. 
print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric = "kl")

# Compute generic set of measures to evaluate the quality of the factorization.
sm = fctr_res.summary()
# Print residual sum of squares.
print "Rss: %8.3f" % sm['rss']
# Print explained variance.
print "Evar: %8.3f" % sm['evar']
# Print actual number of iterations performed.
print "Iterations: %d" % sm['n_iter']
# Print cophenetic correlation. Can be used for rank estimation.
print "cophenetic: %8.3f" % sm['cophenetic']

Standard NMF - Euclidean update equations and fixed initialization (passed matrix factors).

# Import nimfa library entry point for factorization.
import nimfa

# Here we will work with numpy matrix.
import numpy as np

# Generate random target matrix.
V = np.random.rand(30, 20)

# Generate random matrix factors which we will pass as fixed factors to Nimfa.
init_W = np.random.rand(30, 4)
init_H = np.random.rand(4, 20)
# Obviously by passing these factors we want to use rank = 4.

# Run NMF.
# We don't specify any algorithm parameters. Defaults will be used.
# We specify fixed initialization method and pass matrix factors.
fctr = nimfa.mf(V, method = "nmf", seed = "fixed", W = init_W, H = init_H, rank = 4)
fctr_res = nimfa.mf_run(fctr)

# Print the loss function (Euclidean distance between target matrix and its estimate). 
print "Euclidean distance: %5.3e" % fctr_res.distance(metric = "euclidean")

# It should print 'fixed'.
print fctr_res.seeding

# By default, max 30 iterations are performed.
print fctr_res.n_iter

References

[Schmidt2009](1, 2) Mikkel N. Schmidt, Ole Winther, and Lars K. Hansen. Bayesian non-negative matrix factorization. In Proceedings of the 9th International Conference on Independent Component Analysis and Signal Separation, pages 540-547, Paraty, Brazil, 2009.
[Zhang2007]Zhongyuan Zhang, Tao Li, Chris H. Q. Ding and Xiangsun Zhang. Binary Matrix Factorization with applications. In Proceedings of 7th IEEE International Conference on Data Mining, pages 391-400, Omaha, USA, 2007.
[Wang2004]Yuan Wang, Yunde Jia, Changbo Hu and Matthew Turk. Fisher non-negative matrix factorization for learning local features. In Proceedings of the 6th Asian Conference on Computer Vision, pages 27-30, Jeju, Korea, 2004.
[Li2001]Stan Z. Li, Xinwen Huo, Hongjiang Zhang and Qian S. Cheng. Learning spatially localized, parts-based representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 207-212, Kauai, USA, 2001.
[Lin2007](1, 2) Chin J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10): 2756-2779, 2007.
[Lee2001]Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Proceedings of the Neural Information Processing Systems, pages 556-562, Vancouver, Canada, 2001.
[Lee1999]Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788-791, 1999.
[Brunet2004](1, 2) Jean-P. Brunet, Pablo Tamayo, Todd R. Golub and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proceedings of the National Academy of Sciences of the USA, 101(12): 4164-4169, 2004.
[Montano2006]Alberto Pascual-Montano, J. M. Carazo, Kieko Kochi, Dietrich Lehmann and Roberto D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization (nsnmf). In IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3): 403-415, 2006.
[Laurberg2008]Hans Laurberg, Mads G. Christensen, Mark D. Plumbley, Lars K. Hansen and Soren H. Jensen. Theorems on positive data: on the uniqueness of NMF. Computational Intelligence and Neuroscience, doi: 10.1155/2008/764206, 2008.
[Hansen2008]Lars K. Hansen. Generalization in high-dimensional factor models. Web: http://www.stanford.edu/group/mmds/slides2008/hansen.pdf, 2008.
[Dueck2005]Delbert Dueck, Quaid D. Morris and Brendan J. Frey. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics, 21(Suppl 1): 144-151, 2005.
[Dueck2004]Delbert Dueck and Brendan J. Frey. Probabilistic sparse matrix factorization. University of Toronto Technical Report PSI-2004-23, Probabilistic and Statistical Inference Group, University of Toronto, 2004.
[Srebro2001]Nathan Srebro and Tommi Jaakkola. Sparse matrix Factorization of gene expression data. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 2001.
[Li2007]Huan Li, Yu Sun and Ming Zhan. The discovery of transcriptional modules by a two-stage matrix decomposition approach. Bioinformatics, 23(4): 473-479, 2007.
[Park2007](1, 2, 3, 4, 5, 6) Hyuonsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12): 1495-1502, 2007.
[Zhang2011]Shihua Zhang, Qingjiao Li and Xianghong J. Zhou. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics, 27(13): 401-409, 2011.
[Boutsidis2007]Christos Boutsidis and Efstratios Gallopoulos. SVD-based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 41(4): 1350-1362, 2008.
[Albright2006](1, 2, 3) Russell Albright, Carl D. Meyer and Amy N. Langville. Algorithms, initializations, and convergence for the nonnegative matrix factorization. NCSU Technical Report Math 81706, NC State University, Releigh, USA, 2006.
[Hoyer2004]Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5: 1457-1469, 2004.
[Hutchins2008]Lucie N. Hutchins, Sean P. Murphy, Priyam Singh and Joel H. Graber. Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics, 24(23): 2684-2690, 2008.
[Frigyesi2008]Attila Frigyesi and Mattias Hoglund. Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Informatics, 6: 275-292, 2008.
[FWang2008]Fei Wang, Tao Li, Changshui Zhang. Semi-Supervised Clustering via Matrix Factorization. SDM 2008, 1-12, 2008.
[Schietgat2010]Leander Schietgat, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev and Saso Dzeroski. Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics, 11(2), 2010.
[Schachtner2008]
  1. Schachtner, D. Lutter, P. Knollmueller, A. M. Tome, F. J. Theis, G. Schmitz, M. Stetter, P. Gomez Vilda and E. W. Lang. Knowledge-based gene expression classification via matrix factorization. Bioinformatics, 24(15): 1688-1697, 2008.

Acknowledgements

We would like to acknowledge support for this project from the Google Summer of Code 2011 program and from the Slovenian Research Agency grants P2-0209, J2-9699, and L2- 1112.

The nimfa - A Python Library for Nonnegative Matrix Factorization Techniques is part of the Google Summer of Code 2011 program by Marinka Zitnik under supervision of the Orange organization and mentor Prof. Blaz Zupan, PhD.

_images/orange-logo.png _images/google-logo.png

Citation

@article{ZitnikZ12,
  author    = {Marinka Zitnik and Blaz Zupan},
  title     = {NIMFA: A Python Library for Nonnegative Matrix Factorization},
  journal   = {Journal of Machine Learning Research},
  volume    = {13},
  year      = {2012},
  pages     = {849-853},
}

Disclaimer

This software and data is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!

License

nimfa - A Python Library for Nonnegative Matrix Factorization Techniques Copyright (C) 2011-2012 Marinka Zitnik and Blaz Zupan

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.