Data Analysis of Yeast Screens at CCBR’s Documentation

Data analysis of yeast screen based on features extracted from cell and its compartment segmentation. Feature classes include cell area shapes, intensities, textures information and additionally calculated ones.

This packages provides outlier detection and removal of WT strains recognized as outliers, feature standardization, observation normalization, analysis of replicated yeast mutant strains, method for identifying strains with non-WT phenotype and significance assessment (novelty detection by one-class SVM and GMM), decompositions and embedding techniques (MDS, PCA, manifold embedding), clustering, unsupervised feature subset selection via wrapper approach and silhouette index evaluation, various utilities for filtering, splitting and combining features from different plates and collections, option to save the data to Orange format and various plotting options.

Installation

No special installation procedure is specified. However, the package makes extensive use of SciPy and NumPy libraries for fast and convenient matrix manipulation and some linear algebra operations. In addition it uses scikit-learn package for some machine learning algorithms and Matplotlib for plotting. There are not any additional prerequisites.

Download source code (zipped archive) from BitBucket repository. This is a private repository and you need an invitation.

Quickstart

Let us see an example of analysing yeast S. cerevisiae vacuole screen. In all places with <output omitted> some valuable information is printed to the screen regarding the flow of optimization procedures, success of algorithms’ execution and results of analysis. Each example of usage below is accompanied by sample output files produced by running it. One can use data sets stored in Orange format for further investigation and interactive visualization in Orange.

  • Single plate analysis of WT strains:

    >>> import sa
    >>> meta, plates = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
    <output ommitted>
    >>> for fidx in xrange(len(plates)):
    ...    sa.analysis.strains_1p_WT(meta[fidx], plates[fidx], res_path = "tmp/")
    <output omitted>
    

    Same results can be achieved by running:

    > python single_plateWT.py "deletion_vacuole_avgPerStrain/" "tmp/"

    For each plate strains_1p_WT produces the following files in res_path directory:

  • Combined plate analysis of WT strains from multiple collections:

    >>> import sa
    >>> meta1, plates1 = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
    <output omitted>
    >>> meta2, plates2 = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
    <output omitted>
    >>> print "No. plates in coll 1: % d" % len(plates1)
    No. plates in coll 1:  8
    >>> print "No. observations in coll 1: %d" % sum(len(data) for data in plates1)
    No. observations in coll 1: 2746
    >>> print "No. plates in coll 2: %d" % len(plates2)
    No. plates in coll 2: 2
    >>> print "No. observations in coll 2: %d" % sum(len(data) for data in plates2)
    No. observations in coll 2: 613
    >>> plates1.extend(plates2)
    >>> meta1.extend(meta2)
    >>> print "No. plates in joined coll: %d" % len(plates1)
    No. plates in joined coll: 10
    >>> sa.analysis.strains_Np_WT(meta1, plates1, res_path = "tmp/")
    <output omitted>
    

    Same results can be achieved by running:

    > python combined_plates_multi_coll_WT.py "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"

    The last parameter is assumed to be the path to directory where results will be saved. This is preceded by paths to collections of plates.

    Plates from collections are combined and processed jointly. Function strains_Np_WT produces the following files in the res_path directory:

  • Mutant strains that occur multiple times in the data set (replications):

    >>> import sa
    >>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
    <output omitted>
    >>> sa.analysis.strains_repl(data_del, data_ts, data_sg,
    ... repeats_path = "replicates/mutante-duplikati.csv", res_path = "tmp/", repeats_keys = ["RT", "37"])
    <output omitted>
    

    Same results can be achieved by running:

    > python replicated_strains.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "replicates/mutante-duplikati.csv" "tmp/"

    Function strains_repl produces the following files in directory res_path after executing the above code snippet:

  • Distance of strains from different collections:

    >>> import sa
    >>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
    <output omitted>
    >>> sa.analysis.strains_coll(data_del, data_ts, data_sg, res_path = "tmp/")
    <output omitted>
    

    Same results can be achieved by running:

    > python coll_distance.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"

    Function strains_coll preprocesses plates from each collection (standardizing features and removing recognized outliers) and computes distances between strains from the same and different collections (see documentation for further details). The following files are saved to directory res_path:

Alternatively:

>>> #which mutant strains have profiles that significantly differ from WT strains'
>>> #profiles
>>> import sa
>>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
<output omitted>
>>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
<output omitted>
>>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
<output omitted>
>>> sa.analysis.strains_Np_MT(data_del, data_ts, data_sg,
... repeats_path = "replicates/mutante-duplikati.csv", res_path = "tmp/", repeats_keys = ["RT", "37"], standardize = True)
<output omitted>

Same results can be achieved by running:

> python novelty_dist_MT.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "replicates/mutante-duplikati.csv" "tmp/"

Function strains_Np_MT outputs the following files to res_path:

  • hist_mean_MT-WT_distances.pdf (Histogram of mean distances of mutant strains to WT strains.)
  • hist_MT-WT_WT-WT_distances.pdf (Overlayed (i) histogram of mean distances of WT strains to other WT strains and (ii) and histogram of mean mutant distances to WT strains.)
  • hist_signif_YBR131W_100_100.pdf (That kind of histogram is plotted for each ORF)
  • hist_WT-WT_distances.pdf (Histogram of distances between all profiles of WT strains after outlier removal and standardization.)
  • MT-WT_distance.csv (Mean distance of mutant strain from collection of WT profiles after outlier removal and standardization. If replications exist, their distances are stored in the meta column. Strains are ordered by their mean distance to WT profiles descendingly.)
  • MT-WT_distance_prep.csv (Standardized profiles of mutants as specified by ranking in MT-WT_distance.csv.)
  • WT-WT_distance.csv (Mean distance of WT profiles from collection of WT profiles after outlier removal and standardization. WT strains are ordered by distance descendingly.)
  • WT_MT_by_distance_prep.csv (Standardized profiles of mutants and WT strains ordered by distance.)

By setting standardize parameter to False mean distances (MT-WT and WT-WT) are computed before standardization:

  • Feature subset selection (FSS) and clustering:

    >>> import sa
    >>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
    <output omitted>
    >>> sa.analysis.fss(data_del, data_ts, data_sg, res_path = "tmp/")
    <output omitted>
    

    Same results can be achieved by running:

    > python fss.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "tmp/"

    Function fss outputs the following files to res_path:

  • If FSS results are already available:

    >>> import sa
    >>> data_del = sa.utilities.read(dir_path = "deletion_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_ts = sa.utilities.read(dir_path = "TS_vacuole_avgPerStrain/")
    <output omitted>
    >>> data_sg = sa.utilities.read(dir_path = "SG_vacuole_avgPerStrain/")
    <output omitted>
    >>> sa.analysis.fss_post_cluster(data_del, data_ts, data_sg,
    ...                             fss_subset_path = "res/fss_best_subset.csv",
    ...                             fss_cluster_path = "res/fss_best_clustering.csv",
    ...                             res_path = "tmp/")
    <output omitted>
    

    Same results can be achieved by running:

    > python post_fss.py "deletion_vacuole_avgPerStrain/" "TS_vacuole_avgPerStrain/" "SG_vacuole_avgPerStrain/" "res/fss_best_subset.csv" "res/fss_best_clustering.csv" "tmp/"

    Function fss_post_cluster saves the following files to res_path:

Disclaimer

This software and data is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!

Indices and tables

Table Of Contents

Next topic

SA (sa)