Data analysis of yeast screen based on features extracted from cell and its compartment segmentation. Feature classes include cell area shapes, intensities, textures information and additionally calculated ones.
This packages provides outlier detection and removal of WT strains recognized as outliers, feature standardization, observation normalization, analysis of replicated yeast mutant strains, method for identifying strains with non-WT phenotype and significance assessment (novelty detection by one-class SVM and GMM), decompositions and embedding techniques (MDS, PCA, manifold embedding), clustering, unsupervised feature subset selection via wrapper approach and silhouette index evaluation, various utilities for filtering, splitting and combining features from different plates and collections, option to save the data to Orange format and various plotting options.
No special installation procedure is specified. However, the package makes extensive use of SciPy and NumPy libraries for fast and convenient matrix manipulation and some linear algebra operations. In addition it uses scikit-learn package for some machine learning algorithms and Matplotlib for plotting. There are not any additional prerequisites.
Download source code (zipped archive) from BitBucket repository. This is a private repository and you need an invitation.
Let us see an example of analysing yeast S. cerevisiae vacuole screen. In all places with <output omitted> some valuable information is printed to the screen regarding the flow of optimization procedures, success of algorithms’ execution and results of analysis. Each example of usage below is accompanied by sample output files produced by running it. One can use data sets stored in Orange format for further investigation and interactive visualization in Orange.
Single plate analysis of WT strains:
>>> import sa
>>> meta, plates = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output ommitted>
>>> for fidx in xrange(len(plates)):
... sa.analysis.strains_1p_WT(meta[fidx], plates[fidx], res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python single_plateWT.py "deletion_vacuole_avgPerStrain/" "tmp/"
For each plate strains_1p_WT produces the following files in res_path directory:
Here we are interested whether clustering of WT profiles from single plate reveals any interesting patterns. To get an idea of how well-separated the resulting clusters are, one can make a silhouette plot. Silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters on the scale [-1,1].)
less than 0.3. Low silhouette values indicate that observations are nearby to observations from other clusters. The clusters obtained are not very distinct and appear to be not well separated. This is expected as we have removed the WT outliers and prefer homogeneous WT data set.)
Add-plate4_2012-04-26_clustering_2.csv (Clustering in 2 clusters corresponding to the above silhouette plot. Predictions are done for all WT strains from plate Add-plate4. This can help answering the question of how different WT strains are. Better clustering indicates distinct groups of WT strains.)
Add-plate4_2012-04-26_clustering_3.csv (Clustering in 3 clusters corresponding to the above silhouette plot.)
Add-plate4_2012-04-26_clustering_4.csv (Clustering in 4 clusters corresponding to the above silhouette plot.)
Add-plate4_2012-04-26_clustering_5.csv (Clustering in 5 clusters corresponding to the above silhouette plot.)
Add-plate4_2012-04-26_mean_well.pdf (Matrix plot of wells as placed on the plate. The entries are mean observation distances from other observations on the same plate.)
Add-plate4_2012-04-26_orange.tab (Profiles of MT and WT strains from plate after standardization and outlier removal in Orange format.)
Add-plate4_2012-04-26_prep.csv (Profiles of MT and WT strains from plate after standardization and outlier removal in CSV format.)
Add-plate4_2012-04-26_outliers.csv (Detected WT outliers using elliptic envelope method and correction regarding strain cell density.)
Add-plate4_2012-04-26_pca_3d.pdf (3D PCA projection. It is mostly used as s tool in exploratory data analysis and for making predictive models and dimensionality reduction. Transforming and plotting the observations data in principle component space allows us to separate the observations according to variation of the profiles. Useful for detecting outliers. Observations are projected from higher dimension to lower dimension manifold such that the variance of projection along each component is maximized.)
Add-plate4_2012-04-26_cytoplasm_texture_differencevariance_chnl_3_0_plot_hist_with_norm_fit.pdf (Attribute histogram with fitted normal PDF is produced for each attribute.)
Combined plate analysis of WT strains from one collection:
>>> import sa
>>> meta, plates = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.strains_Np_WT(meta, plates, res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python combined_plates_single_coll_WT.py "TS_vacuole_avgPerStrain/" "tmp/"
Function strains_Np_WT produces the following files in the res_path directory. All plates from collection are combined prior analysis. Then similar procedure as for one plate analysis is ran:
Combined plate analysis of WT strains from multiple collections:
>>> import sa
>>> meta1, plates1 = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> meta2, plates2 = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> print "No. plates in coll 1: % d" % len(plates1)
No. plates in coll 1: 8
>>> print "No. observations in coll 1: %d" % sum(len(data) for data in plates1)
No. observations in coll 1: 2746
>>> print "No. plates in coll 2: %d" % len(plates2)
No. plates in coll 2: 2
>>> print "No. observations in coll 2: %d" % sum(len(data) for data in plates2)
No. observations in coll 2: 613
>>> plates1.extend(plates2)
>>> meta1.extend(meta2)
>>> print "No. plates in joined coll: %d" % len(plates1)
No. plates in joined coll: 10
>>> sa.analysis.strains_Np_WT(meta1, plates1, res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python combined_plates_multi_coll_WT.py "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"
The last parameter is assumed to be the path to directory where results will be saved. This is preceded by paths to collections of plates.
Plates from collections are combined and processed jointly. Function strains_Np_WT produces the following files in the res_path directory:
Mutant strains that occur multiple times in the data set (replications):
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.strains_repl(data_del, data_ts, data_sg,
... repeats_path = "replicates/mutante-duplikati.csv", res_path = "tmp/", repeats_keys = ["RT", "37"])
<output omitted>
Same results can be achieved by running:
> python replicated_strains.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "replicates/mutante-duplikati.csv" "tmp/"
Function strains_repl produces the following files in directory res_path after executing the above code snippet:
Distance of strains from different collections:
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.strains_coll(data_del, data_ts, data_sg, res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python coll_distance.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"
Function strains_coll preprocesses plates from each collection (standardizing features and removing recognized outliers) and computes distances between strains from the same and different collections (see documentation for further details). The following files are saved to directory res_path:
Mutant strains that significantly differ from WT strains (identification of strains with non-WT phenotype):
>>> #which mutant strains are so different from strains with WT phenotype
>>> #that do not belong to the same distribution (novelty detection)
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.strains_Np_novelty_MT(data_del, data_ts, data_sg, res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python novelty_SVM_MT.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"
Function strains_Np_novelty outputs the following files to res_path:
Alternatively:
>>> #which mutant strains have profiles that significantly differ from WT strains' >>> #profiles >>> import sa >>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/") <output omitted> >>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/") <output omitted> >>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/") <output omitted> >>> sa.analysis.strains_Np_MT(data_del, data_ts, data_sg, ... repeats_path = "replicates/mutante-duplikati.csv", res_path = "tmp/", repeats_keys = ["RT", "37"], standardize = True) <output omitted>Same results can be achieved by running:
> python novelty_dist_MT.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "replicates/mutante-duplikati.csv" "tmp/"Function strains_Np_MT outputs the following files to res_path:
- hist_mean_MT-WT_distances.pdf (Histogram of mean distances of mutant strains to WT strains.)
- hist_MT-WT_WT-WT_distances.pdf (Overlayed (i) histogram of mean distances of WT strains to other WT strains and (ii) and histogram of mean mutant distances to WT strains.)
- hist_signif_YBR131W_100_100.pdf (That kind of histogram is plotted for each ORF)
- hist_WT-WT_distances.pdf (Histogram of distances between all profiles of WT strains after outlier removal and standardization.)
- MT-WT_distance.csv (Mean distance of mutant strain from collection of WT profiles after outlier removal and standardization. If replications exist, their distances are stored in the meta column. Strains are ordered by their mean distance to WT profiles descendingly.)
- MT-WT_distance_prep.csv (Standardized profiles of mutants as specified by ranking in MT-WT_distance.csv.)
- WT-WT_distance.csv (Mean distance of WT profiles from collection of WT profiles after outlier removal and standardization. WT strains are ordered by distance descendingly.)
- WT_MT_by_distance_prep.csv (Standardized profiles of mutants and WT strains ordered by distance.)
By setting standardize parameter to False mean distances (MT-WT and WT-WT) are computed before standardization:
Feature subset selection (FSS) and clustering:
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.fss(data_del, data_ts, data_sg, res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python fss.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"
Function fss outputs the following files to res_path:
If FSS results are already available:
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.fss_post_cluster(data_del, data_ts, data_sg,
... fss_subset_path = "res/fss_best_subset.csv",
... fss_cluster_path = "res/fss_best_clustering.csv",
... res_path = "tmp/")
<output omitted>
Same results can be achieved by running:
> python post_fss.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "res/fss_best_subset.csv" "res/fss_best_clustering.csv" "tmp/"
Function fss_post_cluster saves the following files to res_path:
This software and data is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!