methods (sa.methods)

This script contains techniques for strains analysis.

sa.methods.analyze_repl(input_file, plates, dnp, res_path, keys=['RT', '37'], save_dist_mat=False)[source]

Estimate the robustness of measurements by analyzing mutants that occur multiple times in the data set.

Save for each ORF mean distance between all ORF pairs and list of between-replications distances to a file named distance_reps.csv in directory :param:`res_path`.

Parameters:
  • input_file (str) – File path to multi-occurring mutants specification.
  • plates (list) – Plates data as returned from utilities.read
  • dnp (numpy.array) – Preprocessed computational profiles
  • res_path (str) – Full path to the directory where results are to be saved.
  • keys (list) – Names of TS (temperature sensitive mutants) plates’ extensions. By default, these are [“RT”, “37”].
  • plot_dist_mat (bool) – Indicator whether to plot distance heatmap for each multiple times occurred mutant
sa.methods.compute_distance(dnp, title, out_name, labels, labelsize=6, rot='vertical')[source]

Considering the rows of :param:`dnp` as vectors, compute the euclidean distance matrix between each pair of vectors.

Save computed distance to file named :param:`out_name` and return distance matrix.

sa.methods.correct_outlier_dec(X, base)[source]

Outlier decision function correction. More cells greater the increase in decision.

sa.methods.decompose_MDS(dnp, c_labels=None, out_dir=None, save_coordinates=True)[source]

Apply metric MDS for a low dimensional representation of the data in which the distances respect well to the distances in the original high dimensional space.

The method constructs a distance matrix with Euclidean distance and runs MDS. The similarities are immersed in 2 dimensions (n_components = 2). Final coordinates are plotted and saved to file mds_scatterplot.pdf in :param:`out_dir` directory. Colors in the plot denotes classes (clusters) passed in :param:`c_labels`.

Parameters:
  • dnp (numpy.array) – Preprocessed computational profiles of strains.
  • c_labels (numpy.array) – Cluster indices of observations.
  • out_dir (str) – Full path to output directory where plot will be saved.
  • save_coordinates (bool) – Indicator whether to plot and save final coordinates. By default, it is set to True

Returned are positions of the observations in the embedding space.

sa.methods.decompose_PCA(dnp, plate, n_components, title, out_name)[source]

Apply PCA decomposition to the data to reduce the dimensionality.

Decompose a multivariate data set in a set of successive orthogonal components that explain a maximum amount of variance.

In addition use the probabilistic PCA to provide a probabilistic interpretation of the PCA that can give a likelihood of the data based on the amount of the variance it explains.

Return transformed data of shape [n_samples, n_components] and save 3D PCA projection to :param:`out_name`.

sa.methods.decompose_manifold(dnp, n_components)[source]

Apply Isomap algorithm to seek a lower dimensional embedding which maintains geodesic distances between all points. Isometric mapping is nonlinear dimensionality reduction method.

Manifold learning is an approach to nonlinear dimensionality reduction. The idea is that the dimensionality of the data set is only artificially high. Manifold learning is an attempt to generalize linear dimensionality reduction such as PCA to be sensitive to nonlinear structure in the data.

Parameters:
  • dnp (numpy.array) – Preprocessed computational profiles of strains.
  • n_components – Number of coordinates for the manifold.

Return transformed data of shape [n_observations, n_components].

sa.methods.detect_novelties_GMM(dnp_wt, dnp_mt, plates_wt, plates_mt, out_dir, save_visualization=True, wt_name=['YOR202W'])[source]

Novelty detection using variational inference for the GMM (Gaussian Mixture Model).

Parameters:
  • dnp_wt (numpy.array) – Preprocessed computational profiles of WT strains.
  • dnp_mt (numpy.array) – Preprocesed computational profiles of MT strains.
  • plates_wt (list) – Plates data of WT strains as returned from utilities.read.
  • plates_mt (list) – Plates data of MT strains as returned from utilities.read.
  • out_dir (str) – Full path to output directory where novelty detection results will be saved.
  • save_visualization (bool) – Indicator whether to visualize training and test observations with labels in a low dimensionl representation. By default, it is set to True.
  • wt_name (list) – Names of the wild-type ORFs.

To directory :param:`out_dir` is saved file named novelty_detection_GMM.csv which contains strain identifiers and predictions (predicted class for each observation and predicted posterior probability of observation for each Gaussian state in the model) for all non-WT observations.

To directory :param:`out_dir` is saved plot novelty_detection_GMM.pdf visualizing training and test observations with learned labels if :param:`save_visualization` is set. A low (2D) dimensional representation is obtained by MDS.

sa.methods.detect_novelties_SVM(dnp_wt, dnp_mt, plates_wt, plates_mt, out_dir, save_visualization=True)[source]

Novelty detection to decide whether a new observation belongs to the same distribution as existing observations or should be considered as different. The training data are profiles of WT strains (preprocessed, that is with removed outliers and standardized features). We are interested in detecting anomalies in mutant strains.

Identification of strains with non-WT phenotype. In addition to clustering of WT and non-WT strains and analyzing cluster memberships, one can perform novelty detection to address whether new observation (non-WT strain) is so different from the training set (WT strains), that we can doubt it is regular (e.g. it comes from different distirbution).

It is used one-class SVM, a semi supervised algorithm that learns a decision function for novelty detection.

Parameters:
  • dnp_wt (numpy.array) – Preprocessed computational profiles of WT strains.
  • dnp_mt (numpy.array) – Preprocesed computational profiles of MT strains.
  • plates_wt (list) – Plates data of WT strains as returned from utilities.read.
  • plates_mt (list) – Plates data of MT strains as returned from utilities.read.
  • out_dir (str) – Full path to output directory where novelty detection results will be saved.
  • save_visualization (bool) – Indicator whether to visualize training and test observations with labels in a low dimensionl representation. By default, it is set to True.

To directory :param:`out_dir` is saved file named novelty_detection_SVM.csv which contains strain identifiers and predictions (predicted class for each observation and observation distance to the separating hyperplane) for all non-WT observations. Predicted classes are +1 (regular novel observations) and -1 (abnormal novel observations). Higher distance to the separating hyperplane indicates greater confidence in prediction.

To directory :param:`out_dir` is saved plot novelty_detection_SVM.pdf visualizing training and test observations with learned labels if :param:`save_visualization` is set. A low (2D) dimensional representation is obtained by MDS.

Note

If visualization option is enabled, optimization process in MDS can increase duration.

See also

See also function sa.methods.detect_novelties_GMM().

sa.methods.detect_outliers(dnp, plate, out_name=None, save=True)[source]

Detect outliers by fitting an elliptic envelope and normalizing the decision function by the number of cells in strain.

The goal is to separate a core of regular observations from some polluting ones and decide whether a new observation belongs to the same distribution as existing observations (it is an inlier) or should be considered different (it is an outlier).

The number of cells per strain used for computing feature values in observation additionally normalize the decision function score.

Parameters:
  • dnp (numpy.array) – Preprocessed computational profiles
  • plates (list) – Plates data as returned from utilities.read
  • out_name (str) – The full name of the file where outliers will be saved.
  • save (bool) – Indicator whether to save outliers’ identification to files. True by default.

Return outlyingness of observations in :param:`dnp` and decision function according to the fitted model for elliptic envelope, respectively and reduced data set without outliers as tuple (reduced_plate, reduced_dnp).

sa.methods.fss_wrapper(dnp, plates, attr_names, out_dir)[source]

Wrapper approach to unsupervised feature subset selection.

The idea is to cluster the data as best we can in each candidate feature subspace and select the most “interesting” subspace with the minimum number of features. Each candidate subspace is evaluated by assessing resulting clusters and feature subset using chosen feature selection criterion. This process is repeated until the best feature subset with its corresponding clusters is found.

The wrapper approach divides the task into three components:
  1. feature search: sequential forward (greedy)
  2. clustering algorithm: k-means with inertia scoring
  3. feature subset evaluation: silhouette coefficient
Parameters:
  • dnp (numpy.array) – Preprocessed computational profiles of strains.
  • plates (list) – Plates data as returned from utilities.read.
  • attr_names (list) – Names of features corresponding to columns in :param:`dnp`.
  • out_dir (str) – Full path to output directory where feature subspace descriptions and cluster memberships will be saved.

To directory :param:`out_dir` is saved file named fss_subsets_scores.csv which contains description of feature subspaces (the name of features used in each candidate space) and achieved mean silhouette coefficient.

To directory :param:`out_dir` is saved file named fss_best_subset.csv which contains description of feature subspace with highest silhouette coefficient and its score.

To directory :param:`out_dir` is saved file named fss_best_clustering.csv which contains strain identifiers and their cluster memberships for all observations.

Returned are predictions for all observations in :param:`dnp` (the cluster each observation belongs to) as specified by the best evaluated feature subspace, score of best candidate feature space and the names of the features from the best candidate space.

sa.methods.k_means(dnp, plates, k_range, out_name_silhouette=None, out_dir_predictions=None, out_name_predictions=None, save_silhouette=True, save_predictions=True, wt_name=['YOR202W'])[source]

Apply k-Means clustering to the data.

Parameters:
  • dnp (numpy.array) – Preprocessed computational profiles of strains.
  • plates (list) – Plates data as returned from utilities.read.
  • k (iterable) – The range used for testing the number of clusters.
  • out_name_silhouette (str) – Fully qualified name of file where silhouette plot will be saved.
  • out_dir_predictions (str) – Full path to output directory where files with clustering predictions will be stored.
  • out_name_predictions (str) – Identifier of data used in the name of the clustering predictions.
  • save_silhouette (bool) – Indicator whether to plot silhouette plot and save it to file :param:`out_name`. By default, it is set to True.
  • save_predictions (bool) – Indicator whether to save predictions (cluster membership for each observation) for each number of clusters in :param:`k_range`.
  • wt_name (list) – Names of the wild-type ORFs.

For each number of clusters compute the mean silhouette coefficient of all observations. Return clustering results for clustering with highest mean silhouette coefficient.

Mean silhouette coefficient close to 1 means the datum is appropriately clustered, coefficient close to -1 means the observations have been assigned to wrong clusters.

Returned are predictions for all observations in data (the closest cluster each observation belongs to), mean silhouette coefficient score of best clustering and a matrix of transformed data to cluster-distance space of shape [n_observations, k]. In the new space, each dimension is the distance to the cluster centers.

sa.methods.normalize(dnp)[source]

Normalize observations individually to unit L2 norm.

Parameters:dnp (numpy.array) – Computational profiles of strains.
sa.methods.outlier2cluster(out_labels, c_labels)[source]

Determine intersection between outliers and cluster mapping (percentage of outliers in each cluster).

Parameters:out_labels – Outliers labelling (1 = inlier, -1 = outlier).

:type out_labels; numpy.array :param c_labels: Cluster indices of observations. :type c_labels: numpy.array

sa.methods.silhouette(X, c_idx, K)[source]

Compute the silhouette score for each observation in :param:`X`.

Parameters:c_idx (list) – cluster indices for all observations in X.
sa.methods.standardize(dnp)[source]

Standardize features by removing the mean and scaling to unit variance.

Parameters:dnp (numpy.array) – Computational profiles of strains.

Previous topic

MDS (sa.mds)

Next topic

plotting (sa.plotting)