Lsnmf (methods.factorization.lsnmf)

Alternating Nonnegative Least Squares Matrix Factorization Using Projected Gradient (bound constrained optimization) method for each subproblem (LSNMF) [Lin2007].

It converges faster than the popular multiplicative update approach.

Algorithm relies on efficiently solving bound constrained subproblems. They are solved using the projected gradient method. Each subproblem contains some (m) independent nonnegative least squares problems. Not solving these separately but treating them together is better because of: problems are closely related, sharing the same constant matrices; all operations are matrix based, which saves computational time.

The main task per iteration of the subproblem is to find a step size alpha such that a sufficient decrease condition of bound constrained problem is satisfied. In alternating least squares, each subproblem involves an optimization procedure and requires a stopping condition. A common way to check whether current solution is close to a stationary point is the form of the projected gradient [Lin2007].

# Example call of LSNMF with algorithm specific parameters set    
fctr = nimfa.mf(V, 
              seed = "random_vcol", 
              rank = 10, 
              method = "lsnmf", 
              max_iter = 12, 
              initialize_only = True,
              sub_iter = 10,
              inner_sub_iter = 10, 
              beta = 0.1)
fctr_res = nimfa.mf_run(fctr)
class nimfa.methods.factorization.lsnmf.Lsnmf(**params)

Bases: nimfa.models.nmf_std.Nmf_std

For detailed explanation of the general model parameters see mf_run.

If :param:`min_residuals` of the underlying model is not specified, default value of :param:`min_residuals` 1e-5 is set. In LSNMF :param:`min_residuals` is used as an upper bound of quotient of projected gradients norm and initial gradient (initial gradient of basis and mixture matrix). It is a tolerance for a stopping condition.

The following are algorithm specific model options which can be passed with values as keyword arguments.

Parameters:
  • sub_iter (int) – Maximum number of subproblem iterations. Default value is 10.
  • inner_sub_iter (int) – Number of inner iterations when solving subproblems. Default value is 10.
  • beta (float) – The rate of reducing the step size to satisfy the sufficient decrease condition when solving subproblems. Smaller beta more aggressively reduces the step size, but may cause the step size being too small. Default value is 0.1.
basis()

Return the matrix of basis vectors.

coef(idx=None)

Return the matrix of mixture coefficients.

Parameters:idx (None) – Used in the multiple NMF model. In standard NMF :param:`idx` is always None.
connectivity(H=None, idx=None)

Compute the connectivity matrix for the samples based on their mixture coefficients.

The connectivity matrix C is a symmetric matrix which shows the shared membership of the samples: entry C_ij is 1 iff sample i and sample j belong to the same cluster, 0 otherwise. Sample assignment is determined by its largest metagene expression value.

Return connectivity matrix.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
consensus(idx=None)

Compute consensus matrix as the mean connectivity matrix across multiple runs of the factorization. It has been proposed by [Brunet2004] to help visualize and measure the stability of the clusters obtained by NMF.

Tracking of matrix factors across multiple runs must be enabled for computing consensus matrix. For results of a single NMF run, the consensus matrix reduces to the connectivity matrix.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
coph_cor(idx=None)

Compute cophenetic correlation coefficient of consensus matrix, generally obtained from multiple NMF runs.

The cophenetic correlation coefficient is measure which indicates the dispersion of the consensus matrix and is based on the average of connectivity matrices. It measures the stability of the clusters obtained from NMF. It is computed as the Pearson correlation of two distance matrices: the first is the distance between samples induced by the consensus matrix; the second is the distance between samples induced by the linkage used in the reordering of the consensus matrix [Brunet2004].

Return real number. In a perfect consensus matrix, cophenetic correlation equals 1. When the entries in consensus matrix are scattered between 0 and 1, the cophenetic correlation is < 1. We observe how this coefficient changes as factorization rank increases. We select the first rank, where the magnitude of the cophenetic correlation coefficient begins to fall [Brunet2004].

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
dim(idx=None)

Return triple containing the dimension of the target matrix and matrix factorization rank.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
dispersion(idx=None)

Compute the dispersion coefficient of consensus matrix, generally obtained from multiple NMF runs.

The dispersion coefficient is based on the average of connectivity matrices [Park2007]. It measures the reproducibility of the clusters obtained from multiple NMF runs.

Return the real value in [0,1]. Dispersion is 1 iff for a perfect consensus matrix, where all entries are 0 or 1. A perfect consensus matrix is obtained only when all the connectivity matrices are the same, meaning that the algorithm gave the same clusters at each run.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
distance(metric='euclidean', idx=None)

Return the loss function value.

Parameters:
  • distance (str with values ‘euclidean’ or ‘kl’) – Specify distance metric to be used. Possible are Euclidean and Kullback-Leibler (KL) divergence. Strictly, KL is not a metric.
  • idx (None) – Used in the multiple NMF model. In standard NMF :param:`idx` is always None.
entropy(membership=None, idx=None)

Compute the entropy of the NMF model given a priori known groups of samples [Park2007].

The entropy is a measure of performance of a clustering method in recovering classes defined by a list a priori known (true class labels).

Return the real number. The smaller the entropy, the better the clustering performance.

Parameters:
  • membership (list) – Specify known class membership for each sample.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
estimate_rank(range=xrange(30, 51), n_run=10, idx=0, what='all')

Choosing factorization parameters carefully is vital for success of a factorization. However, the most critical parameter is factorization rank. This method tries different values for ranks, performs factorizations, computes some quality measures of the results and chooses the best value according to [Brunet2004] and [Hutchins2008].

Note

The process of rank estimation can be lengthy.

Note

Matrix factors are tracked during rank estimation. This is needed for computing cophenetic correlation coefficient.

Return a dict (keys are values of rank from range, values are `dict`s of measures) of quality measures for each value in rank’s range. This can be passed to the visualization model, from which estimated rank can be established.

Parameters:
  • range (list or tuple like range of int) – Range of factorization ranks to try. Default is xrange(30, 51).
  • n_run (int) – The number of runs to be performed for each value in range. Default is 10.
  • what (list or tuple like of str) –

    Specify quality measures of the results computed for each rank. By default, summary of the fitted factorization model is computed. Instead, user can supply list of strings that matches some of the following quality measures:

    • sparseness
    • rss
    • evar
    • residuals
    • connectivity
    • dispersion
    • cophenetic
    • consensus
    • euclidean
    • kl
  • idx (str or int) – Name of the matrix (coefficient) matrix. Used only in the multiple NMF model. Default is 0 (first coefficient matrix).
evar(idx=None)

Compute the explained variance of the NMF estimate of the target matrix.

This measure can be used for comparing the ability of models for accurately reproducing the original target matrix. Some methods specifically aim at minimizing the RSS and maximizing the explained variance while others not, which one should note when using this measure.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
factorize()

Compute matrix factorization.

Return fitted factorization model.

fitted(idx=None)

Compute the estimated target matrix according to the NMF algorithm model.

Parameters:idx (None) – Used in the multiple NMF model. In standard NMF :param:`idx` is always None.
is_satisfied(c_obj, iter)

Compute the satisfiability of the stopping criteria based on stopping parameters and objective function value.

Return logical value denoting factorization continuation.

Parameters:
  • c_obj (float) – Current objective function value.
  • iter (int) – Current iteration number.
objective()

Compute projected gradients norm.

predict(what='samples', prob=False, idx=None)

Compute the dominant basis components. The dominant basis component is computed as the row index for which the entry is the maximum within the column.

If :param:`prob` is not specified, list is returned which contains computed index for each sample (feature). Otherwise tuple is returned where first element is a list as specified before and second element is a list of associated probabilities, relative contribution of the maximum entry within each column.

Parameters:
  • what (str) – Specify target for dominant basis components computation. Two values are possible, ‘samples’ or ‘features’. When what=’samples’ is specified, dominant basis component for each sample is determined based on its associated entries in the mixture coefficient matrix (H). When what=’features’ computation is performed on the transposed basis matrix (W.T).
  • prob (bool equivalent) – Specify dominant basis components probability inclusion.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
purity(membership=None, idx=None)

Compute the purity given a priori known groups of samples [Park2007].

The purity is a measure of performance of a clustering method in recovering classes defined by a list a priori known (true class labels).

Return the real number in [0,1]. The larger the purity, the better the clustering performance.

Parameters:
  • membership (list) – Specify known class membership for each sample.
  • idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
residuals(idx=None)

Return residuals matrix between the target matrix and its NMF estimate.

Parameters:idx (None) – Used in the multiple NMF model. In standard NMF :param:`idx` is always None.
rss(idx=None)

Compute Residual Sum of Squares (RSS) between NMF estimate and target matrix [Hutchins2008].

This measure can be used to estimate optimal factorization rank. [Hutchins2008] suggested to choose the first value where the RSS curve presents an inflection point. [Frigyesi2008] suggested to use the smallest value at which the decrease in the RSS is lower than the decrease of the RSS obtained from random data.

RSS tells us how much of the variation in the dependent variables our model did not explain.

Return real value.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
run()

Run the specified MF algorithm.

score_features(idx=None)

Compute the score for each feature that represents its specificity to one of the basis vector [Park2007].

A row vector of the basis matrix (W) indicates the contributions of a gene to the r (i.e. columns of W) biological pathways or processes. As genes can participate in more than one biological process, it is beneficial to investigate genes that have relatively large coefficient in each biological process.

Return the list containing score for each feature. The feature scores are real values in [0,1]. The higher the feature score the more basis-specific the corresponding feature.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
select_features(idx=None)

Compute the most basis-specific features for each basis vector [Park2007].

[Park2007] scoring schema and feature selection method is used. The features are first scored using the score_features(). Then only the features that fulfill both the following criteria are retained: #. score greater than u + 3s, where u and s are the median and the median absolute deviation (MAD) of the scores, resp., #. the maximum contribution to a basis component (i.e the maximal value in the corresponding row of the basis matrix (W)) is larger

than the median of all contributions (i.e. of all elements of basis matrix (W)).

Return list of retained features’ indices.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
set_params()

Set algorithm specific model options.

sparseness(idx=None)

Compute sparseness of matrix (basis vectors matrix, mixture coefficients) [Hoyer2004]. This sparseness measure quantifies how much energy of a vector is packed into only few components. The sparseness of a vector is a real number in [0, 1]. Sparser vector has value closer to 1. The measure is 1 iff vector contains single nonzero component and the measure is equal to 0 iff all components are equal.

Sparseness of a matrix is the mean sparseness of its column vectors.

Return tuple that contains sparseness of the basis and mixture coefficients matrices.

Parameters:idx (None or str with values ‘coef’ or ‘coef1’ (int value of 0 or 1, respectively)) – Used in the multiple NMF model. In factorizations following standard NMF model or nonsmooth NMF model :param:`idx` is always None.
target(idx=None)

Return the target matrix to estimate.

Parameters:idx (None) – Used in the multiple NMF model. In standard NMF :param:`idx` is always None.
update()

Update basis and mixture matrix.

Previous topic

Lfnmf (methods.factorization.lfnmf)

Next topic

Nmf (methods.factorization.nmf)