Snmnmf (methods.factorization.snmnmf)

Sparse Network-Regularized Multiple Nonnegative Matrix Factorization (SNMNMF) [Zhang2011].

It is semi-supervised learning method with constraints (e. g. in comodule identification, any variables linked in A or B, are more likely placed in the same comodule) to improve relevance and narrow down the search space.

The advantage of this method is the integration of multiple matrices for multiple types of variables (standard NMF methods can be applied to a target matrix containing just one type of variable) together with prior knowledge (e. g. network representing relationship among variables).

The objective function in [Zhang2011] has three components:
  1. first component models miRNA and gene expression profiles;
  2. second component models gene-gene network interactions;
  3. third component models predicted miRNA-gene interactions.
The inputs for the SNMNMF are:
  1. two sets of expression profiles (represented by the matrices V and V1 of shape s x m, s x n, respectively) for miRNA and genes measured on the same set of samples;
  2. (PRIOR KNOWLEDGE) a gene-gene interaction network (represented by the matrix A of shape n x n), including protein-protein interactions and DNA-protein interactions; the network is presented in the form of the adjacency matrix of gene network;
  3. (PRIOR KNOWLEDGE) a list of predicted miRNA-gene regulatory interactions (represented by the matrix B of shape m x n) based on sequence data; the network is presented in the form of the adjacency matrix of a bipartite miRNA-gene network. Network regularized constraints are used to enforce “must-link” constraints and to ensure that genes with known interactions have similar coefficient profiles.

Gene and miRNA expression matrices are simultaneously factored into a common basis matrix (W) and two coefficients matrices (H and H1). Additional knowledge is incorporated into this framework with network regularized constraints. Because of the imposed sparsity constraints easily interpretable solution is obtained. In [Zhang2011] decomposed matrix components are used to provide information about miRNA-gene regulatory comodules. They identified the comodules based on shared components (a column in basis matrix W) with significant association values in the corresponding rows of coefficients matrices, H1 and H2.

In SNMNMF a strategy suggested by Kim and Park (2007) is adopted to make the coefficient matrices sparse.

Note

In [Zhang2011] H1 and H2 notation corresponds to the H

and H1 here, respectively.

import numpy as np
import scipy.sparse as sp

import nimfa

V = np.random.rand(40, 100)
V1 = np.random.rand(40, 200)
snmnmf = nimfa.Snmnmf(V=V, V1=V1, seed="random_c", rank=10, max_iter=12,
                      A=sp.csr_matrix((V1.shape[1], V1.shape[1])),
                      B=sp.csr_matrix((V.shape[1], V1.shape[1])), gamma=0.01,
                      gamma_1=0.01, lamb=0.01, lamb_1=0.01)
snmnmf_fit = snmnmf()
class nimfa.methods.factorization.snmnmf.Snmnmf(V, V1, seed=None, W=None, H=None, H1=None, rank=30, max_iter=30, min_residuals=1e-05, test_conv=None, n_run=1, callback=None, callback_init=None, track_factor=False, track_error=False, A=None, B=None, gamma=0.01, gamma_1=0.01, lamb=0.01, lamb_1=0.01, **options)

Bases: nimfa.models.nmf_mm.Nmf_mm

Parameters:
  • V (Instance of the scipy.sparse sparse matrices types, numpy.ndarray, numpy.matrix or tuple of instances of the latter classes.) – The target matrix to estimate.
  • V1 (Instance of the scipy.sparse sparse matrices types, numpy.ndarray, numpy.matrix or tuple of instances of the latter classes.) – The target matrix to estimate. Used by algorithms that consider more than one target matrix.
  • seed (str naming the method or methods.seeding.nndsvd.Nndsvd or None) – Specify method to seed the computation of a factorization. If specified :param:`W` and :param:`H` seeding must be None. If neither seeding method or initial fixed factorization is specified, random initialization is used.
  • W (scipy.sparse or numpy.ndarray or numpy.matrix or None) – Specify initial factorization of basis matrix W. Default is None. When specified, :param:`seed` must be None.
  • H (Instance of the scipy.sparse sparse matrices types, numpy.ndarray, numpy.matrix, tuple of instances of the latter classes or None) – Specify initial factorization of mixture matrix H. Default is None. When specified, :param:`seed` must be None.
  • rank (int) – The factorization rank to achieve. Default is 30.
  • n_run (int) – It specifies the number of runs of the algorithm. Default is 1. If multiple runs are performed, fitted factorization model with the lowest objective function value is retained.
  • callback (function) – Pass a callback function that is called after each run when performing multiple runs. This is useful if one wants to save summary measures or process the result before it gets discarded. The callback function is called with only one argument models.mf_fit.Mf_fit that contains the fitted model. Default is None.
  • callback_init (function) – Pass a callback function that is called after each initialization of the matrix factors. In case of multiple runs the function is called before each run (more precisely after initialization and before the factorization of each run). In case of single run, the passed callback function is called after the only initialization of the matrix factors. This is useful if one wants to obtain the initialized matrix factors for further analysis or additional info about initialized factorization model. The callback function is called with only one argument models.mf_fit.Mf_fit that (among others) contains also initialized matrix factors. Default is None.
  • track_factor (bool) – When :param:`track_factor` is specified, the fitted factorization model is tracked during multiple runs of the algorithm. This option is taken into account only when multiple runs are executed (:param:`n_run` > 1). From each run of the factorization all matrix factors are retained, which can be very space consuming. If space is the problem setting the callback function with :param:`callback` is advised which is executed after each run. Tracking is useful for performing some quality or performance measures (e.g. cophenetic correlation, consensus matrix, dispersion). By default fitted model is not tracked.
  • track_error (bool) – Tracking the residuals error. Only the residuals from each iteration of the factorization are retained. Error tracking is not space consuming. By default residuals are not tracked and only the final residuals are saved. It can be used for plotting the trajectory of the residuals.
  • A (scipy.sparse of format csr, csc, coo, bsr, dok, lil, dia or numpy.matrix) – Adjacency matrix of gene-gene interaction network (dimension: V1.shape[1] x V1.shape[1]).
  • B (scipy.sparse of format csr, csc, coo, bsr, dok, lil, dia or numpy.matrix) – Adjacency matrix of a bipartite miRNA-gene network, predicted miRNA-target interactions (dimension: V.shape[1] x V1.shape[1]).
  • gamma (float) – Limit the growth of the basis matrix (W). Default is 0.01.
  • gamma_1 (float) – Encourage sparsity of the mixture (coefficient) matrices (H and H1). Default is 0.01.
  • lamb (float) – Weight for the must-link constraints defined in A. Default is 0.01.
  • lamb_1 (float) – Weight for the must-link constraints define in B. Default is 0.01.

Stopping criterion

Factorization terminates if any of specified criteria is satisfied.

Parameters:
  • max_iter (int) – Maximum number of factorization iterations. Note that the number of iterations depends on the speed of method convergence. Default is 30.
  • min_residuals (float) – Minimal required improvement of the residuals from the previous iteration. They are computed between the target matrix and its MF estimate using the objective function associated to the MF algorithm. Default is None.
  • test_conv (int) – It indicates how often convergence test is done. By default convergence is tested each iteration.
basis()

Return the matrix of basis vectors.

coef(idx)

Return the matrix of mixture coefficients.

Parameters:idx (str with values ‘coef’ or ‘coef1’ (int value of 0 or 1 respectively)) – Name of the matrix (coefficient) matrix.
connectivity(H=None, idx=None)

Compute the connectivity matrix for the samples based on their mixture coefficients.

The connectivity matrix C is a symmetric matrix which shows the shared membership of the samples: entry C_ij is 1 iff sample i and sample j belong to the same cluster, 0 otherwise. Sample assignment is determined by its largest metagene expression value.

Return connectivity matrix.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
consensus(idx=None)

Compute consensus matrix as the mean connectivity matrix across multiple runs of the factorization. It has been proposed by [Brunet2004] to help visualize and measure the stability of the clusters obtained by NMF.

Tracking of matrix factors across multiple runs must be enabled for computing consensus matrix. For results of a single NMF run, the consensus matrix reduces to the connectivity matrix.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
coph_cor(idx=None)

Compute cophenetic correlation coefficient of consensus matrix, generally obtained from multiple NMF runs.

The cophenetic correlation coefficient is measure which indicates the dispersion of the consensus matrix and is based on the average of connectivity matrices. It measures the stability of the clusters obtained from NMF. It is computed as the Pearson correlation of two distance matrices: the first is the distance between samples induced by the consensus matrix; the second is the distance between samples induced by the linkage used in the reordering of the consensus matrix [Brunet2004].

Return real number. In a perfect consensus matrix, cophenetic correlation equals 1. When the entries in consensus matrix are scattered between 0 and 1, the cophenetic correlation is < 1. We observe how this coefficient changes as factorization rank increases. We select the first rank, where the magnitude of the cophenetic correlation coefficient begins to fall [Brunet2004].

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
dim(idx=None)

Return triple containing the dimension of the target matrix and matrix factorization rank.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
dispersion(idx=None)

Compute dispersion coefficient of consensus matrix

Dispersion coefficient [Park2007] measures the reproducibility of clusters obtained from multiple NMF runs.

Return the real value in [0,1]. Dispersion is 1 for a perfect consensus matrix and has value in [0,0] for a scattered consensus matrix.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In standard NMF model or nonsmooth NMF model idx is always None.
distance(metric='euclidean', idx=None)

Return the loss function value.

Parameters:
  • distance (str with values ‘euclidean’ or ‘kl’) – Specify distance metric to be used. Possible are Euclidean and Kullback-Leibler (KL) divergence. Strictly, KL is not a metric.
  • idx (str with values ‘coef’ or ‘coef1’ (int value of 0 or 1 respectively)) – Name of the matrix (coefficient) matrix.
entropy(membership=None, idx=None)

Compute the entropy of the NMF model given a priori known groups of samples [Park2007].

The entropy is a measure of performance of a clustering method in recovering classes defined by a list a priori known (true class labels).

Return the real number. The smaller the entropy, the better the clustering performance.

Parameters:
  • membership (list) – Specify known class membership for each sample.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
estimate_rank(rank_range=[30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], n_run=10, idx=0, what='all')

Choosing factorization parameters carefully is vital for success of a factorization. However, the most critical parameter is factorization rank. This method tries different values for ranks, performs factorizations, computes some quality measures of the results and chooses the best value according to [Brunet2004] and [Hutchins2008].

Note

The process of rank estimation can be lengthy.

Note

Matrix factors are tracked during rank estimation. This is needed for computing cophenetic correlation coefficient.

Return a dict (keys are values of rank from range, values are `dict`s of measures) of quality measures for each value in rank’s range. This can be passed to the visualization model, from which estimated rank can be established.

Parameters:
  • rank_range (list or tuple like range of int) – Range of factorization ranks to try. Default is range(30, 51).
  • n_run (int) – The number of runs to be performed for each value in range. Default is 10.
  • what (list or tuple like of str) –

    Specify quality measures of the results computed for each rank. By default, summary of the fitted factorization model is computed. Instead, user can supply list of strings that matches some of the following quality measures:

    • sparseness
    • rss
    • evar
    • residuals
    • connectivity
    • dispersion
    • cophenetic
    • consensus
    • euclidean
    • kl
  • idx (str or int) – Name of the matrix (coefficient) matrix. Used only in the multiple NMF model. Default is 0 (first coefficient matrix).
evar(idx=None)

Compute the explained variance of the NMF estimate of the target matrix.

This measure can be used for comparing the ability of models for accurately reproducing the original target matrix. Some methods specifically aim at minimizing the RSS and maximizing the explained variance while others not, which one should note when using this measure.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
factorize()

Compute matrix factorization.

Return fitted factorization model.

fitted(idx)

Compute the estimated target matrix according to the nonsmooth NMF algorithm model.

Parameters:idx (str with values ‘coef’ or ‘coef1’ (int value of 0 or 1 respectively)) – Name of the matrix (coefficient) matrix.
is_satisfied(p_obj, c_obj, iter)

Compute the satisfiability of the stopping criteria based on stopping parameters and objective function value.

Return logical value denoting factorization continuation.

Parameters:
  • p_obj (float) – Objective function value from previous iteration.
  • c_obj (float) – Current objective function value.
  • iter (int) – Current iteration number.
objective()

Compute three component objective function as defined in [Zhang2011].

predict(what='samples', prob=False, idx=None)

Compute the dominant basis components. The dominant basis component is computed as the row index for which the entry is the maximum within the column.

If prob is not specified, list is returned which contains computed index for each sample (feature). Otherwise tuple is returned where first element is a list as specified before and second element is a list of associated probabilities, relative contribution of the maximum entry within each column.

Parameters:
  • what (str) – Specify target for dominant basis components computation. Two values are possible, ‘samples’ or ‘features’. When what=’samples’ is specified, dominant basis component for each sample is determined based on its associated entries in the mixture coefficient matrix (H). When what=’features’ computation is performed on the transposed basis matrix (W.T).
  • prob (bool equivalent) – Specify dominant basis components probability inclusion.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
purity(membership=None, idx=None)

Compute the purity given a priori known groups of samples [Park2007].

The purity is a measure of performance of a clustering method in recovering classes defined by a list a priori known (true class labels).

Return the real number in [0,1]. The larger the purity, the better the clustering performance.

Parameters:
  • membership (list) – Specify known class membership for each sample.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
residuals(idx)

Return residuals matrix between the target matrix and its multiple NMF estimate.

Parameters:idx (str with values ‘coef’ or ‘coef1’ (int value of 0 or 1 respectively)) – Name of the matrix (coefficient) matrix.
rss(idx=None)

Compute Residual Sum of Squares (RSS) between NMF estimate and target matrix [Hutchins2008].

This measure can be used to estimate optimal factorization rank. [Hutchins2008] suggested to choose the first value where the RSS curve presents an inflection point. [Frigyesi2008] suggested to use the smallest value at which the decrease in the RSS is lower than the decrease of the RSS obtained from random data.

RSS tells us how much of the variation in the dependent variables our model did not explain.

Return real value.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model idx is always None.
score_features(idx=None)

Score features in terms of their specificity to the basis vectors [Park2007].

A row vector of the basis matrix (W) indicates contributions of a feature to the r (i.e. columns of W) latent components. It might be informative to investigate features that have strong component-specific membership values to the latent components.

Return array with feature scores. Feature scores are real-valued from interval [0,1]. Higher value indicates greater feature specificity.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In standard NMF model or nonsmooth NMF model idx is always None.
select_features(idx=None)

Compute the most basis-specific features for each basis vector [Park2007].

[Park2007] scoring schema and feature selection method is used. The features are first scored using the score_features(). Then only the features that fulfill both the following criteria are retained:

  1. score greater than u + 3s, where u and s are the median and the median absolute deviation (MAD) of the scores, resp.,
  2. the maximum contribution to a basis component (i.e the maximal value in the corresponding row of the basis matrix (W)) is larger than the median of all contributions (i.e. of all elements of basis matrix (W)).

Return a boolean array indicating whether features were selected.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In standard NMF model or nonsmooth NMF model idx is always None.
sparseness(idx=None)

Compute sparseness of matrix (basis vectors matrix, mixture coefficients) [Hoyer2004].

Sparseness of a vector quantifies how much energy is packed into its components. The sparseness of a vector is a real number in [0, 1], where sparser vector has value closer to 1. Sparseness is 1 iff the vector contains a single nonzero component and is equal to 0 iff all components of the vector are equal.

Sparseness of a matrix is mean sparseness of its column vectors.

Return tuple that contains sparseness of the basis and mixture coefficients matrices.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In standard NMF model or nonsmooth NMF model idx is always None.
target(idx)

Return the target matrix to estimate.

Parameters:idx (str with values ‘coef’ or ‘coef1’ (int value of 0 or 1 respectively)) – Name of the matrix (coefficient) matrix.
update(iter)

Update basis and mixture matrix.

Fork me on GitHub