Assessing the functional structure of genomic data
-
Curtis Huttenhower and
Olga G. Troyanskaya.
Assessing the functional structure of genomic data, Bioinformatics,
ISMB 2008 Special Issue.
The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating a variety of systems-level biological relationships. As thousands of such datasets become available, they provide an opportunity to study high-level associations between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries.
We analyze the functional structure of S. cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of associations between processes, and a measure of similar functional activity between genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens.
Supplemental Materials
- Yeast gene function predictions. Predictions of S. cerevisiae genes involved in each of the processes of interest calculated from the association scores below. Three cutoffs are given for each process: the median, third quartile, and upper outer fence of the scores of genes annotated the process. Only the genes passing the most strict prediction cutoff (upper outer fence) are listed here, and genes annotated to the processes are starred.
- Gene/process association scores. Individual scores providing the ratio of connectivity for each yeast gene to each process of interest. Each ratio's numerator is the gene's average probability of functional relationship to a gene in the process, and the denominator is the gene's global average probability of functional relationship. Negative signs indicate genes already annotated to the process. These scores are not normalized by the processes' cohesivenesses, since it would not influence function prediction (in which only intra-process comparisons are made).
- Supplemental Table 1. Predicted functional association strengths between biological processes of interest. These processes were chosen from the Gene Ontology using the method of Myers et al 2006.
- Supplemental Table 1'. A renormalized version of Supplemental Table 1 providing (hopefully) more interpretable weights.
- Supplemental Table 2. Normalized functional activity scores for each biological process within each analyzed dataset. Scores represent the weight given to each dataset within the analyzed biological areas by a Bayesian integration system, with higher weights indicating increased confidence in (and likely activity of) a particular process.
- Supplemental Table 3. Association of each biological process of interest with the ~1,500 uncharacterized genes of the yeast genome. Each score represents the ratio of the average predicted probability of functional relationship between the uncharacterized genes and the set of genes known to participate in each biological area, normalized by that process's cohesiveness.