This script contains techniques for strains analysis.
Estimate the robustness of measurements by analyzing mutants that occur multiple times in the data set.
Save for each ORF mean distance between all ORF pairs and list of between-replications distances to a file named distance_reps.csv in directory :param:`res_path`.
Parameters: |
|
---|
Considering the rows of :param:`dnp` as vectors, compute the euclidean distance matrix between each pair of vectors.
Save computed distance to file named :param:`out_name` and return distance matrix.
Outlier decision function correction. More cells greater the increase in decision.
Apply metric MDS for a low dimensional representation of the data in which the distances respect well to the distances in the original high dimensional space.
The method constructs a distance matrix with Euclidean distance and runs MDS. The similarities are immersed in 2 dimensions (n_components = 2). Final coordinates are plotted and saved to file mds_scatterplot.pdf in :param:`out_dir` directory. Colors in the plot denotes classes (clusters) passed in :param:`c_labels`.
Parameters: |
|
---|
Returned are positions of the observations in the embedding space.
Apply PCA decomposition to the data to reduce the dimensionality.
Decompose a multivariate data set in a set of successive orthogonal components that explain a maximum amount of variance.
In addition use the probabilistic PCA to provide a probabilistic interpretation of the PCA that can give a likelihood of the data based on the amount of the variance it explains.
Return transformed data of shape [n_samples, n_components] and save 3D PCA projection to :param:`out_name`.
Apply Isomap algorithm to seek a lower dimensional embedding which maintains geodesic distances between all points. Isometric mapping is nonlinear dimensionality reduction method.
Manifold learning is an approach to nonlinear dimensionality reduction. The idea is that the dimensionality of the data set is only artificially high. Manifold learning is an attempt to generalize linear dimensionality reduction such as PCA to be sensitive to nonlinear structure in the data.
Parameters: |
|
---|
Return transformed data of shape [n_observations, n_components].
Novelty detection using variational inference for the GMM (Gaussian Mixture Model).
Parameters: |
|
---|
To directory :param:`out_dir` is saved file named novelty_detection_GMM.csv which contains strain identifiers and predictions (predicted class for each observation and predicted posterior probability of observation for each Gaussian state in the model) for all non-WT observations.
To directory :param:`out_dir` is saved plot novelty_detection_GMM.pdf visualizing training and test observations with learned labels if :param:`save_visualization` is set. A low (2D) dimensional representation is obtained by MDS.
See also
function sa.methods.detect_novelties_SVM().
Novelty detection to decide whether a new observation belongs to the same distribution as existing observations or should be considered as different. The training data are profiles of WT strains (preprocessed, that is with removed outliers and standardized features). We are interested in detecting anomalies in mutant strains.
Identification of strains with non-WT phenotype. In addition to clustering of WT and non-WT strains and analyzing cluster memberships, one can perform novelty detection to address whether new observation (non-WT strain) is so different from the training set (WT strains), that we can doubt it is regular (e.g. it comes from different distirbution).
It is used one-class SVM, a semi supervised algorithm that learns a decision function for novelty detection.
Parameters: |
|
---|
To directory :param:`out_dir` is saved file named novelty_detection_SVM.csv which contains strain identifiers and predictions (predicted class for each observation and observation distance to the separating hyperplane) for all non-WT observations. Predicted classes are +1 (regular novel observations) and -1 (abnormal novel observations). Higher distance to the separating hyperplane indicates greater confidence in prediction.
To directory :param:`out_dir` is saved plot novelty_detection_SVM.pdf visualizing training and test observations with learned labels if :param:`save_visualization` is set. A low (2D) dimensional representation is obtained by MDS.
Note
If visualization option is enabled, optimization process in MDS can increase duration.
See also
See also function sa.methods.detect_novelties_GMM().
Detect outliers by fitting an elliptic envelope and normalizing the decision function by the number of cells in strain.
The goal is to separate a core of regular observations from some polluting ones and decide whether a new observation belongs to the same distribution as existing observations (it is an inlier) or should be considered different (it is an outlier).
The number of cells per strain used for computing feature values in observation additionally normalize the decision function score.
Parameters: |
|
---|
Return outlyingness of observations in :param:`dnp` and decision function according to the fitted model for elliptic envelope, respectively and reduced data set without outliers as tuple (reduced_plate, reduced_dnp).
Wrapper approach to unsupervised feature subset selection.
The idea is to cluster the data as best we can in each candidate feature subspace and select the most “interesting” subspace with the minimum number of features. Each candidate subspace is evaluated by assessing resulting clusters and feature subset using chosen feature selection criterion. This process is repeated until the best feature subset with its corresponding clusters is found.
Parameters: |
|
---|
To directory :param:`out_dir` is saved file named fss_subsets_scores.csv which contains description of feature subspaces (the name of features used in each candidate space) and achieved mean silhouette coefficient.
To directory :param:`out_dir` is saved file named fss_best_subset.csv which contains description of feature subspace with highest silhouette coefficient and its score.
To directory :param:`out_dir` is saved file named fss_best_clustering.csv which contains strain identifiers and their cluster memberships for all observations.
Returned are predictions for all observations in :param:`dnp` (the cluster each observation belongs to) as specified by the best evaluated feature subspace, score of best candidate feature space and the names of the features from the best candidate space.
Apply k-Means clustering to the data.
Parameters: |
|
---|
For each number of clusters compute the mean silhouette coefficient of all observations. Return clustering results for clustering with highest mean silhouette coefficient.
Mean silhouette coefficient close to 1 means the datum is appropriately clustered, coefficient close to -1 means the observations have been assigned to wrong clusters.
Returned are predictions for all observations in data (the closest cluster each observation belongs to), mean silhouette coefficient score of best clustering and a matrix of transformed data to cluster-distance space of shape [n_observations, k]. In the new space, each dimension is the distance to the cluster centers.
Normalize observations individually to unit L2 norm.
Parameters: | dnp (numpy.array) – Computational profiles of strains. |
---|
Determine intersection between outliers and cluster mapping (percentage of outliers in each cluster).
Parameters: | out_labels – Outliers labelling (1 = inlier, -1 = outlier). |
---|
:type out_labels; numpy.array :param c_labels: Cluster indices of observations. :type c_labels: numpy.array
Compute the silhouette score for each observation in :param:`X`.
Parameters: | c_idx (list) – cluster indices for all observations in X. |
---|