Dataset collection, normalization, and complete listing (referenced in section 2.1)Microarray data was collected from a variety of sources to create our compendium, including NCBI's Gene Expression Omnibus, EBI's ArrayExpress, the Stanford Microarray Database, and several other publication and laboratory web pages. These data from 81 publications, totaling 2394 array hybridizations, were broken down into their smallest logical groupings of conditions. For example, the stress response dataset from (Gasch et al., 2000) originally consisted of 142 hybridizations corresponding to several different types of induced stress and growth phases. We have separated this dataset in a manner similar to the authors' analyses, resulting in 21 logical datasets such as "hydrogen peroxide exposure," "osmotic shock," and "heat shock from 25° to 37°."
In order to make valid comparisons between the datasets collected, all data was normalized in a similar manner. First, suspect values were removed (i.e. missing values were inserted) in all data based on the information available in the original publication where possible, or in a manner appropriate to the microarray platform used. After identifying missing values, any genes present in less than 50% of the conditions in a dataset were removed from that dataset. Remaining missing values were imputed using the KNN impute algorithm with K=10 using Euclidean distance to identify nearest neighbors. After the imputation process, technical replicates were averaged together, resulting in data files of complete matrices with one entry per gene appearing in the dataset.
Most of the data collected falls into two main categories: dual-color competitive hybridization data and single-channel data. Dual-color data was typically found in log ratio format or was transformed into this format. Single-channel data was typically from Affymetrix platforms and was log transformed as a final step in normalization. Other types of data were transformed into a format as close as possible to these sources.
The full list of publications and datasets collected for our functional analysis and search engine are available as a tab-delimited text file and in html:
Examples of the Fisher z-transform, standard normalization process (referenced in section 2.1)The distribution of Pearson correlations between all pairs of genes within each dataset varies greatly depending on the number of conditions in each dataset, the process targeted, and the array platform used. In order to ensure comparable measures of correlation from one dataset to the next, we employ the Fisher z-transformation as described in the main text. This produces distributions of pair-wise correlations that are approximately normal for all datasets.
|Histograms of correlations between all gene pairs|
|DeRisi et al., 1997||Primig et al., 2000|
|DeRisi et al., 1997||Primig et al., 2000|
Functional Coverage Analysis (referenced in sections 2.2 and 4.1, and Figure 1 caption)The full table of functional coverage consists of a matrix containing pseudo p-values (based on the z-test for significance) for each combination of dataset and GO biological process examined. These files are available as a tab-demilited text file of p-values, a (very large) image, and a hierarchically clustered version compatible with JavaTreeView for browsing. Note that a p-value of ~10-10 corresponds to the Bonferroni corrected p-value of 10-4 which was used for significance testing in Figure 1.
In addition to the z-test results, we have also calculated significance based on the non-parametric two-sample Kolmogorov-Smirnov test. The results of the KS-test show significance for generally the same GO term/dataset pairings, however several more pairs are also found to be significant. This is due to the fact that the KS-test can judge distributions significantly different if the shapes are sufficiently different, while the means are very similar. As the z-test is based on differences in means, it would not consider such distributions to be significantly different. The full results of the KS-test are available here in a tab-delimited text file:
Supplemental Table 1
Our analysis of the functional coverage of existing gene expression microarray data for S. cerevisiae characterizes both which biological processes are represented in each dataset and which biological processes are represented in existing data as a whole. This table shows a selection of processes that are significant in many datasets (left column), significant in some, but not many datasets (center column), and significant in very few datasets (right column). Of those processes under-represented in the compendium, there are three major explanations identified: (A) non-transcriptionally regulated processes, (B) processes not occurring in many common laboratory strains, and (C) specific processes not yet targeted by existing gene expression microarray data. Biological processes in the later category may be areas that warrant further investigation.
|Highly-represented (significant in >15 datasets)||Moderately-represented (significant in <15 but >3 datasets)||Under-represented (significant in <3 datasets)|
|tricarboxylic acid cycle||response to oxidative stress||MAPKKK cascade (A)|
|DNA repair||amino acid transport||protein kinase cascade (A)|
|glycolysis||exocytosis||Ras protein signal transduction (A)|
|phosphate metabolism||vesicle fusion||mating type switching (B)|
|chromosome segregation||meiotic recombination||invasive growth (B)|
|DNA replication||arginine biosynthesis||pseudohyphal growth (B)|
|electron transport||steroid metabolism||response to salt stress (C)|
|ubiquitin-dependent protein catabolism||alcohol metabolism||heme biosynthesis (C)|
|ribosome assembly||double-strand break repair||mitochondrial genome maintenance (C)|
|amino acid metabolism||filamentous growth||telomerase-dependent telomere maintenance (C)|
Benefits of SVD-based signal balancing (referenced in section 2.3.1)We have quantitatively found that our application of SVD to microarray data for the purpose of signal balancing performs much more accurately than the traditional use of SVD for noise reduction. In the traditional use of SVD, low singular values and their corresponding singular vectors are removed from the decomposed matrices (UΣVT), then the matrices are multipled back together to reconstruct a version of the original data matrix (X). Often, enough singular values are retained to account for some percentage of the variation of the original data. However, in our analysis we find that performance generally degrades when using this traditional application of SVD. Rather, by calculating correlations within the left singular vectors (U) we perform our analysis in a space where the more dominant patterns are dampened and the less dominant patterns are magnified. Note that this process is related to some applications of SVD to microarray data, such as the work by Alter et al. which found that dominant eigengenes are sometimes highly correlated with noise.
The following figure compares our use of SVD for signal balancing with retaining 50% and 90% of data variance and reconstructing the orignal data matrix. The type of analysis is the same as described in the main text.
Positive vs. negative correlation performance (referenced in section 2.3.2)In the main text we discuss our choice to limit the influence of negative correlations on our method by disregarding z-scores less than 1 standard deviation away from the mean. We have found that negative correlations tend to not be functionally informative in may cases. As an example of this effect we have examined the precision- recall plot of positive correlations across all microarray data and negative correlations across all data. The following graph was created using the GRIFn system (Myers et al., 2006). Several reference datasets are included for comparison:
An interactive SVG of this analysis is also available (requires the Adobe SVG plugin and a compatible browser to view).
Discussion of precision-recall curves and average precision (referenced in section 2.4)Precision-recall curves were created by traversing the ordered list of results for each method for each GO term examined and calculating precision, recall pairs at each step. Precision is calculated as the ratio of true positive (TP) predictions to the sum of TP and false positive (FP) predictions. Recall is measured as the number of TPs recovered for individual GO terms, or as the proportion of TPs to the total number of possible TPs (TP + FN [false negatives]) for results averaged over multiple GO terms.
Average precision was used as a summary statistic for comparing the performance of different methods in a more straightforward way. This measure is commonly used in search domains, such as document retrieval. Average precision is calculated for a GO term as:
Where the GO term contains k genes, and ranki is the rank placement of the ith gene annotated to the term in the ordered list of results. Note that if all genes annotated to a GO term appear as the first k genes in the ordered result list, the average precision will be 1. Also note that this measure is a quantized version of the area under the precision-recall curve.
Performance Results of SPELL on 126 Diverse GO Terms (referenced in sections 2.4, 4.3, and Figure 3 caption)In addition to the summary comparision available in Figure 3 of the main text, here are individual results for all 126 GO terms analyzed.
ARP8 predictions and verification (referenced in sections 4.4.1 and Figure 6 caption)SPELL predicts that the un-annotated gene, ARP8 is involved in the following 13 biological processes which break down into 3 main classes:
Supplemental Table 2
|Predicted GO term for Arp8||Class|
|mitotic cell cycle||Cell Cycle|
|regulation of progression through cell cycle||Cell Cycle|
|cell division||Cell Cycle|
|asexual reproduction||Cell Cycle|
|transcription from RNA polymerase II promoter||Transcription|
|negative regulation of transcription||Transcription|
|positive regulation of transcription||Transcription|
|cytoskeleton organization and biogenesis||Morphology|
|response to osmotic stress||Other|
Cell volume was determined using the Z2 automated cell counter (Beckman Coulter, Fullerton, California, United States). Culture was diluted into Isotone II buffer for the measurement. Cell morphology was determined using a 40x objective on a Zeiss Axioskop (Germany). The entire field of view is shown for both wild-type yeast and the arp8 deletion allowing for direct comparison of the images in Figure 6.