nimfa - A Python Library for Nonnegative Matrix Factorization Techniques

Nimfa is an open-source Python library that provides a unified interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. Both dense and sparse matrix representation are supported.

The sample script using Nimfa on medulloblastoma gene expression data is given below. It uses alternating least squares nonnegative matrix factorization with projected gradient method for subproblems [Lin2007] and Random Vcol [Albright2006] initialization algorithm. An object returned by nimfa.mf_run is fitted factorization model through which user can access matrix factors and estimate quality measures.

import nimfa

V = nimfa.examples.medulloblastoma.read(normalize=True)

fctr = nimfa.mf(V, seed='random_vcol', method='lsnmf', rank=40, max_iter=65)
fctr_res = nimfa.mf_run(fctr)

print 'Rss: %5.4f' % fctr_res.fit.rss()
print 'Evar: %5.4f' % fctr_res.fit.evar()
print 'K-L divergence: %5.4f' % fctr_res.distance(metric='kl')
print 'Sparseness, W: %5.4f, H: %5.4f' % fctr_res.fit.sparseness()

Running this script produces the following output, where slight differences in reported scores across different runs can be attributed to randomness of the Random Vcol initialization method:

Rss: 0.2668
Evar: 0.9997
K-L divergence: 38.8744
Sparseness, W: 0.7297, H: 0.8796

Further real-world examples and demonstrations of library capabilities are provided in the Examples section and in sample scripts on this page.

Content

Matrix Factorization Methods

  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning local features [Wang2004], [Li2001]
  • LSNMF - Alternating nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network-regularized multiple nonnegative matrix factorization [Zhang2011]
  • PMFCC - Penalized Matrix Factorization for Constrained Clustering [FWang2008]

Initialization Methods

Quality Measures

  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares (rank estimation) [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix (rank estimation) [Brunet2004]
  • Dispersion [Park2007]
  • Factorization rank estimation
  • Selected matrix factorization method specific

Utils

  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs

Installation

Source

No special installation procedure is specified. However, the nimfa library makes extensive use of SciPy and NumPy libraries for fast and convenient dense and sparse matrix manipulation and some linear algebra operations. There are not any additional prerequisites.

Download source code (zipped archive) from Github repository.

Unzip the archive. To build and install run:

python setup.py install

Documentation

For building the documentation use Sphinx 1.0 or newer. Sphinx is available at Sphinx home page and nimfa library documentation sources are available at Github repository. Before building documentation, please install the nimfa library.

Documentation can be built by issuing:

sphinx-build -b html -E <source dir [docs/source]>  <build dir [html]>

from the nimfa root directory. After completion, the documentation should reside in the html directory. The built documentation should resemble closely the one you are currently reading.

Note

The nimfa library documentation is contained in docs source directory and and scripts are in nimfa directory.

Configuration

Methods configuration goes through:

  1. runtime specific options (e. g. tracking fitted model across multiple runs, tracking residuals across iterations, etc.);
  2. algorithm specific options (e. g. prior information with PSMF, type of update equations with NMF, initial value for noise variance with ICM, etc.).

For details and descriptions on algorithm specific options see specific algorithm documentation. For details on runtime specific options and explanation of the general model parameters see mf_run.

Usage

Following are basic usage examples that employ different implemented factorization algorithms. For examples with real world applications see Examples section and methods’ documentation.

Note

Consider these examples as Hello World examples for the nimfa library.

Standard NMF - Divergence on scipy.sparse matrix with matrix factors estimation.

import nimfa

# Construct sparse matrix in CSR format
from scipy.sparse import csr_matrix
from scipy import array
from numpy import dot

V = csr_matrix((array([1, 2, 3, 4, 5, 6]), array([0, 2, 2, 0, 1, 2]), array([0, 2, 3, 6])), shape=(3, 3))
print V.todense()

# Standard NMF rank 4 algorithm
# Returned object is fitted factorization model
fctr = nimfa.mf(V, method="nmf", max_iter=30, rank=4, update='divergence', objective='div')
# The fctr_res's attribute `fit` contains factorization method attributes
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Basis matrix"
print W.todense()

# Mixture matrix.
H = fctr_res.coef()
print "Coef"
print H.todense()

# Kullback-Leibler divergence value
print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric="kl")

# Generic measures of factorization quality
sm = fctr_res.summary()
# Sparseness (Hoyer, 2004) of basis and mixture matrix
print "Sparseness Basis: %5.3f  Mixture: %5.3f" % (sm['sparseness'][0], sm['sparseness'][1])
# Actual number of iterations performed
print "Iterations: %d" % sm['n_iter']

# Estimate of target matrix V
print "Estimate"
print dot(W.todense(), H.todense())

LSNMF on numpy dense matrix with quality and performance measures.

import nimfa

# Here we will work with numpy matrix
import numpy as np

V = np.matrix([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print V

# LSNMF rank 3 algorithm
fctr = nimfa.mf(V, method="lsnmf", max_iter=10, rank=3)
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Basis matrix"
print W

# Mixture matrix.
H = fctr_res.coef()
print "Coef"
print H

# Kullback-Leibler divergence. By
print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric="kl")

sm = fctr_res.summary()
# Residual sum of squares - can be used for factorization rank selection
print "Rss: %8.3f" % sm['rss']
# Explained variance.
print "Evar: %8.3f" % sm['evar']
# Actual number of iterations performed
print "Iterations: %d" % sm['n_iter']

# Estimate of target matrix V
print "Estimate"
print np.dot(W, H)

LSNMF with Random VCol initialization and error tracking.

import nimfa
import numpy as np

V = np.matrix([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print V

# LSNMF rank 3 algorithm
fctr = nimfa.mf(V, seed="random_vcol", method="lsnmf", max_iter=10, rank=3, track_error=True)
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Basis matrix"
print W

# Mixture matrix.
H = fctr_res.coef()
print "Coef"
print H

# Error tracking.
print "Error tracking"
# A list of objective function values for each iteration in factorization is printed.
# If error tracking is enabled and user specifies multiple runs of the factorization, get_error(run = n) return a list of objective values from n-th run.
# fctr_res.fit.tracker is an instance of Mf_track --
# isinstance(fctr_res.fit.tracker, nimfa.models.mf_track.Mf_track)
print fctr_res.fit.tracker.get_error()

# Generic set of measures to evaluate factorization quality
sm = fctr_res.summary()
print "Rss: %8.3f" % sm['rss']
print "Evar: %8.3f" % sm['evar']
print "Iterations: %d" % sm['n_iter']

ICM with Random C initialization and passed callback initialization function.

import nimfa

import numpy as np

V = np.matrix([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print V

# Initialization callback function
def init_info(model):
    print "Initialized basis matrix\n", model.basis()
    print "Initialized  mixture matrix\n", model.coef()

# ICM rank 3 algorithm
# We specify callback_init parameter by passing a init_info function
# Callback is called after initialization and prior to factorization in each run.
fctr = nimfa.mf(V, seed="random_c", method="icm", max_iter=10, rank=3, callback_init=init_info)
fctr_res = nimfa.mf_run(fctr)

# Basis matrix.
W = fctr_res.basis()
print "Resulting basis matrix"
print W

# Mixture matrix.
H = fctr_res.coef()
print "Resulting mixture matrix"
print H

sm = fctr_res.summary()
print "Rss: %8.3e" % sm['rss']
print "Evar: %8.3e" % sm['evar']
print "Iterations: %d" % sm['n_iter']
print "KL divergence: %5.3e" % sm['kl']
print "Euclidean distance: %5.3e" % sm['euclidean']

BMF with default parameters, multiple runs and factor tracking for consensus matrix computation.

import nimfa

import numpy as np

V = np.random.rand(23, 200)

# Run BMF.
# Factorization will be run 3 times (n_run) and factors will be tracked for computing
# cophenetic correlation. Note increased time and space complexity.
fctr = nimfa.mf(V, method="bmf", max_iter=10, rank=30, n_run=3, track_factor=True)
fctr_res = nimfa.mf_run(fctr)

print "Distance Kullback-Leibler: %5.3e" % fctr_res.distance(metric="kl")

sm = fctr_res.summary()
print "Rss: %8.3f" % sm['rss']
print "Evar: %8.3f" % sm['evar']
print "Iterations: %d" % sm['n_iter']
print "cophenetic: %8.3f" % sm['cophenetic']

Standard NMF - Euclidean update equations and fixed initialization (passed matrix factors).

References

[Schmidt2009](1, 2) Mikkel N. Schmidt, Ole Winther, and Lars K. Hansen. Bayesian non-negative matrix factorization. In Proceedings of the 9th International Conference on Independent Component Analysis and Signal Separation, pages 540-547, Paraty, Brazil, 2009.
[Zhang2007]Zhongyuan Zhang, Tao Li, Chris H. Q. Ding and Xiangsun Zhang. Binary Matrix Factorization with applications. In Proceedings of 7th IEEE International Conference on Data Mining, pages 391-400, Omaha, USA, 2007.
[Wang2004]Yuan Wang, Yunde Jia, Changbo Hu and Matthew Turk. Fisher non-negative matrix factorization for learning local features. In Proceedings of the 6th Asian Conference on Computer Vision, pages 27-30, Jeju, Korea, 2004.
[Li2001]Stan Z. Li, Xinwen Huo, Hongjiang Zhang and Qian S. Cheng. Learning spatially localized, parts-based representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 207-212, Kauai, USA, 2001.
[Lin2007](1, 2) Chin J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10): 2756-2779, 2007.
[Lee2001]Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Proceedings of the Neural Information Processing Systems, pages 556-562, Vancouver, Canada, 2001.
[Lee1999]Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788-791, 1999.
[Brunet2004](1, 2) Jean-P. Brunet, Pablo Tamayo, Todd R. Golub and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proceedings of the National Academy of Sciences of the USA, 101(12): 4164-4169, 2004.
[Montano2006]Alberto Pascual-Montano, J. M. Carazo, Kieko Kochi, Dietrich Lehmann and Roberto D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization (nsnmf). In IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3): 403-415, 2006.
[Laurberg2008]Hans Laurberg, Mads G. Christensen, Mark D. Plumbley, Lars K. Hansen and Soren H. Jensen. Theorems on positive data: on the uniqueness of NMF. Computational Intelligence and Neuroscience, doi: 10.1155/2008/764206, 2008.
[Hansen2008]Lars K. Hansen. Generalization in high-dimensional factor models. Web: http://www.stanford.edu/group/mmds/slides2008/hansen.pdf, 2008.
[Dueck2005]Delbert Dueck, Quaid D. Morris and Brendan J. Frey. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics, 21(Suppl 1): 144-151, 2005.
[Dueck2004]Delbert Dueck and Brendan J. Frey. Probabilistic sparse matrix factorization. University of Toronto Technical Report PSI-2004-23, Probabilistic and Statistical Inference Group, University of Toronto, 2004.
[Srebro2001]Nathan Srebro and Tommi Jaakkola. Sparse matrix Factorization of gene expression data. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 2001.
[Li2007]Huan Li, Yu Sun and Ming Zhan. The discovery of transcriptional modules by a two-stage matrix decomposition approach. Bioinformatics, 23(4): 473-479, 2007.
[Park2007](1, 2, 3, 4, 5, 6) Hyuonsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12): 1495-1502, 2007.
[Zhang2011]Shihua Zhang, Qingjiao Li and Xianghong J. Zhou. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics, 27(13): 401-409, 2011.
[Boutsidis2007]Christos Boutsidis and Efstratios Gallopoulos. SVD-based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 41(4): 1350-1362, 2008.
[Albright2006](1, 2, 3) Russell Albright, Carl D. Meyer and Amy N. Langville. Algorithms, initializations, and convergence for the nonnegative matrix factorization. NCSU Technical Report Math 81706, NC State University, Releigh, USA, 2006.
[Hoyer2004]Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5: 1457-1469, 2004.
[Hutchins2008]Lucie N. Hutchins, Sean P. Murphy, Priyam Singh and Joel H. Graber. Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics, 24(23): 2684-2690, 2008.
[Frigyesi2008]Attila Frigyesi and Mattias Hoglund. Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Informatics, 6: 275-292, 2008.
[FWang2008]Fei Wang, Tao Li, Changshui Zhang. Semi-Supervised Clustering via Matrix Factorization. SDM 2008, 1-12, 2008.
[Schietgat2010]Leander Schietgat, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev and Saso Dzeroski. Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics, 11(2), 2010.
[Schachtner2008]
  1. Schachtner, D. Lutter, P. Knollmueller, A. M. Tome, F. J. Theis, G. Schmitz, M. Stetter, P. Gomez Vilda and E. W. Lang. Knowledge-based gene expression classification via matrix factorization. Bioinformatics, 24(15): 1688-1697, 2008.

Acknowledgements

We would like to acknowledge support for this project from the Google Summer of Code 2011 program and from the Slovenian Research Agency grants P2-0209, J2-9699, and L2-1112.

The nimfa - A Python Library for Nonnegative Matrix Factorization Techniques was part of the Google Summer of Code 2011 (Orange). It is authored by Marinka Zitnik, advised by Blaz Zupan, Ph.D..

_images/orange-logo.png _images/google-logo.png

Citation

@article{ZitnikZ12,
  author    = {Marinka Zitnik and Blaz Zupan},
  title     = {NIMFA: A Python Library for Nonnegative Matrix Factorization},
  journal   = {Journal of Machine Learning Research},
  volume    = {13},
  year      = {2012},
  pages     = {849-853},
}

Disclaimer

This software and data is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!

License

nimfa - A Python Library for Nonnegative Matrix Factorization Techniques Copyright (C) 2011-2014 Marinka Zitnik and Blaz Zupan

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.