As a background reading before this example, we recommend user to read [Schietgat2010] and [Schachtner2008] where the authors study the use of decision tree based models for predicting the multiple gene functions and unsupervised matrix factorization techniques to extract marker genes from gene expression profiles for classification into diagnostic categories, respectively.
This example from functional genomics deals with predicting gene functions. Two main characteristics of gene function prediction task are:
 single gene can have multiple functions,
 the functions are organized in a hierarchy, in particular in a hierarchy structered as a rooted tree – MIPS Functional Catalogue. A gene related to some function is automatically related to all its ancestor functions. Data set used in this example originates from S. cerevisiae and has annotations from the MIPS Functional Catalogue.
The latter problem setting describes hierarchical multilabel classification (HMC).
Note
The S. cerevisiae FunCat annotated data set used in this example is not included in the datasets. If you wish to perform the gene function prediction experiments, start by downloading the data set. In particular D1 (FC) seq data set must be available for the example to run. Download links are listed in the datasets. To run the example, uncompress the data and put it into corresponding data directory, namely the extracted data set must exist in the S_cerevisiae_FC directory under datasets. Once you have the data installed, you are ready to start running the experiments.
Here is the outline of this gene function prediction task.
 Reading S. cerevisiae sequence data, i. e. train, validation and test set. Reading meta data, attributes’ labels and class labels. Weights are used to distinguish direct and indirect class memberships of genes in gene function classes according to FunCat annotations.
 Preprocessing, i. e. normalizing data matrix of test data and data matrix of joined train and validation data.
 Factorization of train data matrix. We used SNMF/L factorization algorithm for train data.
 Factorization of test data matrix. We used SNMF/L factorization algorithm for train data.
 Application of rules for class assignments. Three rules can be used, average correlation and maximal correlation, as in [Schachtner2008] and threshold maximal correlation. All class assignments rules are generalized to meet the hierarchy constraint imposed by the rooted tree structure of MIPS Functional Catalogue.
 Precisionrecall (PR) evaluation measures.
To run the example simply type:
python gene_func_prediction.py
or call the module’s function:
import nimfa.examples
nimfa.examples.gene_func_prediction.run()
Note
This example uses matplotlib library for producing visual interpretation.
Apply rules for class assignments. In [Schachtner2008] two rules are proposed, average correlation and maximal correlation. Here, both the rules are implemented and can be specified through :param:`method``parameter. In addition to these the threshold maximal correlation rule is possible as well. Class assignments rules are generalized to multilabel classification incorporating hierarchy constraints.
Though any method based on similarity measures can be used, we estimate correlation coefficients. Let w be the gene profile of test basis matrix for which we want to predict gene functions. For each class C a separate index set A of indices is created, where A encompasses all indices m, for which mth profile of train basis matrix has label C. Index set B contains all remaining indices. Now, the average correlation coefficient between w and elements of A is computed, similarly average correlation coefficient between w and elements of B. Finally, w is assigned label C if the former correlation over the respective index set is greater than the latter correlation.
Note
Described rule assigns the class label according to an average correlation of test vector with all vectors belonging to one or the other index set. Minor modification of this rule is to assign the class label according to the maximal correlation occurring between the test vector and the members of each index set.
Note
As noted before the main problem of this example is the HMC (hierarchical multilabel classification) setting. Therefore we generalized the concepts from articles describing the use of factorization for binary classification problems to multilabel classification. Additionally, we use the weights for class memberships to incorporate hierarchical structure of MIPS MIPS Functional Catalogue.
Return mapping of gene functions to genes.
Parameters: 


Return type:  dict 
Estimate correlation coefficients between profiles of train basis matrix and profiles of test basis matrix.
Return the estimated correlation coefficients of the features (variables).
Parameters: 


Return type:  numpy.matrix 
Perform factorization on S. cerevisiae FunCat annotated sequence data set (D1 FC seq).
Return factorized data, this is matrix factors as result of factorization (basis and mixture matrix).
Parameters:  data (tuple) – Transformed data set containing attributes’ values, class information and possibly additional meta information. 

Report the performance with the precisionrecall (PR) based evaluation measures.
Beside PR also ROC based evaluations have been used before to evaluate gene function prediction approaches. PR based better suits the characteristics of the common HMC task, in which many classes are infrequent with a small number of genes having particular function. That is for most classes the number of negative instances exceeds the number of positive instances. Therefore it is sometimes preferred to recognize the positive instances instead of correctly predicting the negative ones (i. e. gene does not have a particular function). That means that ROC curve might be less suited for the task as they reward a learner if it correctly predicts negative instances.
Return PR evaluations measures
Parameters: 


Return type:  tuple 
Preprocess S.cerevisiae FunCat annotated sequence data set (D1 FC seq). Preprocessing step includes data normalization.
Return preprocessed data.
Parameters:  data (tuple) – Transformed data set containing attributes’ values, class information and possibly additional meta information. 

Read S. cerevisiae FunCat annotated sequence data set (D1 FC seq).
Return attributes’ values and class information of the test data set and joined train and validation data set. Additional mapping functions are returned mapping attributes’ names and classes’ names to indices.
Run the gene function prediction example on the S. cerevisiae sequence data set (D1 FC seq).
Read data in the ARFF format and transform it to suitable matrix for factorization process. For each feature update direct and indirect class information exploiting properties of Functional Catalogue hierarchy.
Return attributes’ values and class information. If :param:`include_meta` is specified additional mapping functions are provided with mapping from indices to attributes’ names and indices to classes’ names.
Parameters: 

