About the website

Dataset collection, normalization, and complete listing (referenced in section 2.1)

Microarray data was collected from a variety of sources to create our compendium, including NCBI's Gene Expression Omnibus, EBI's ArrayExpress, the Stanford Microarray Database, and several other publication and laboratory web pages. These data from 81 publications, totaling 2394 array hybridizations, were broken down into their smallest logical groupings of conditions. For example, the stress response dataset from (Gasch et al., 2000) originally consisted of 142 hybridizations corresponding to several different types of induced stress and growth phases. We have separated this dataset in a manner similar to the authors' analyses, resulting in 21 logical datasets such as "hydrogen peroxide exposure," "osmotic shock," and "heat shock from 25° to 37°."

In order to make valid comparisons between the datasets collected, all data was normalized in a similar manner. First, suspect values were removed (i.e. missing values were inserted) in all data based on the information available in the original publication where possible, or in a manner appropriate to the microarray platform used. After identifying missing values, any genes present in less than 50% of the conditions in a dataset were removed from that dataset. Remaining missing values were imputed using the KNN impute algorithm with K=10 using Euclidean distance to identify nearest neighbors. After the imputation process, technical replicates were averaged together, resulting in data files of complete matrices with one entry per gene appearing in the dataset.

Most of the data collected falls into two main categories: dual-color competitive hybridization data and single-channel data. Dual-color data was typically found in log ratio format or was transformed into this format. Single-channel data was typically from Affymetrix platforms and was log transformed as a final step in normalization. Other types of data were transformed into a format as close as possible to these sources.

The full list of publications and datasets collected for our functional analysis and search engine are available as a tab-delimited text file and in html:

Examples of the Fisher z-transform, standard normalization process (referenced in section 2.1)

The distribution of Pearson correlations between all pairs of genes within each dataset varies greatly depending on the number of conditions in each dataset, the process targeted, and the array platform used. In order to ensure comparable measures of correlation from one dataset to the next, we employ the Fisher z-transformation as described in the main text. This produces distributions of pair-wise correlations that are approximately normal for all datasets.
Histograms of correlations between all gene pairs
DeRisi et al., 1997Primig et al., 2000
DeRisi et al., 1997Primig et al., 2000

Functional Coverage Analysis (referenced in sections 2.2 and 4.1, and Figure 1 caption)

The full table of functional coverage consists of a matrix containing pseudo p-values (based on the z-test for significance) for each combination of dataset and GO biological process examined. These files are available as a tab-demilited text file of p-values, a (very large) image, and a hierarchically clustered version compatible with JavaTreeView for browsing. Note that a p-value of ~10-10 corresponds to the Bonferroni corrected p-value of 10-4 which was used for significance testing in Figure 1.
JavaTreeView Compatible

In addition to the z-test results, we have also calculated significance based on the non-parametric two-sample Kolmogorov-Smirnov test. The results of the KS-test show significance for generally the same GO term/dataset pairings, however several more pairs are also found to be significant. This is due to the fact that the KS-test can judge distributions significantly different if the shapes are sufficiently different, while the means are very similar. As the z-test is based on differences in means, it would not consider such distributions to be significantly different. The full results of the KS-test are available here in a tab-delimited text file:
KS results

Supplemental Table 1
Our analysis of the functional coverage of existing gene expression microarray data for S. cerevisiae characterizes both which biological processes are represented in each dataset and which biological processes are represented in existing data as a whole. This table shows a selection of processes that are significant in many datasets (left column), significant in some, but not many datasets (center column), and significant in very few datasets (right column). Of those processes under-represented in the compendium, there are three major explanations identified: (A) non-transcriptionally regulated processes, (B) processes not occurring in many common laboratory strains, and (C) specific processes not yet targeted by existing gene expression microarray data. Biological processes in the later category may be areas that warrant further investigation.

Highly-represented (significant in >15 datasets) Moderately-represented (significant in <15 but >3 datasets) Under-represented (significant in <3 datasets)
tricarboxylic acid cycleresponse to oxidative stressMAPKKK cascade (A)
DNA repairamino acid transportprotein kinase cascade (A)
glycolysisexocytosisRas protein signal transduction (A)
phosphate metabolismvesicle fusionmating type switching (B)
chromosome segregationmeiotic recombinationinvasive growth (B)
DNA replicationarginine biosynthesispseudohyphal growth (B)
electron transportsteroid metabolismresponse to salt stress (C)
ubiquitin-dependent protein catabolismalcohol metabolismheme biosynthesis (C)
ribosome assemblydouble-strand break repairmitochondrial genome maintenance (C)
amino acid metabolismfilamentous growthtelomerase-dependent telomere maintenance (C)

Benefits of SVD-based signal balancing (referenced in section 2.3.1)

We have quantitatively found that our application of SVD to microarray data for the purpose of signal balancing performs much more accurately than the traditional use of SVD for noise reduction. In the traditional use of SVD, low singular values and their corresponding singular vectors are removed from the decomposed matrices (UΣVT), then the matrices are multipled back together to reconstruct a version of the original data matrix (X). Often, enough singular values are retained to account for some percentage of the variation of the original data. However, in our analysis we find that performance generally degrades when using this traditional application of SVD. Rather, by calculating correlations within the left singular vectors (U) we perform our analysis in a space where the more dominant patterns are dampened and the less dominant patterns are magnified. Note that this process is related to some applications of SVD to microarray data, such as the work by Alter et al. which found that dominant eigengenes are sometimes highly correlated with noise.

The following figure compares our use of SVD for signal balancing with retaining 50% and 90% of data variance and reconstructing the orignal data matrix. The type of analysis is the same as described in the main text.

Positive vs. negative correlation performance (referenced in section 2.3.2)

In the main text we discuss our choice to limit the influence of negative correlations on our method by disregarding z-scores less than 1 standard deviation away from the mean. We have found that negative correlations tend to not be functionally informative in may cases. As an example of this effect we have examined the precision- recall plot of positive correlations across all microarray data and negative correlations across all data. The following graph was created using the GRIFn system (Myers et al., 2006). Several reference datasets are included for comparison:

An interactive SVG of this analysis is also available (requires the Adobe SVG plugin and a compatible browser to view).

Discussion of precision-recall curves and average precision (referenced in section 2.4)

Precision-recall curves were created by traversing the ordered list of results for each method for each GO term examined and calculating precision, recall pairs at each step. Precision is calculated as the ratio of true positive (TP) predictions to the sum of TP and false positive (FP) predictions. Recall is measured as the number of TPs recovered for individual GO terms, or as the proportion of TPs to the total number of possible TPs (TP + FN [false negatives]) for results averaged over multiple GO terms.

Average precision was used as a summary statistic for comparing the performance of different methods in a more straightforward way. This measure is commonly used in search domains, such as document retrieval. Average precision is calculated for a GO term as:

Where the GO term contains k genes, and ranki is the rank placement of the ith gene annotated to the term in the ordered list of results. Note that if all genes annotated to a GO term appear as the first k genes in the ordered result list, the average precision will be 1. Also note that this measure is a quantized version of the area under the precision-recall curve.

Performance Results of SPELL on 126 Diverse GO Terms (referenced in sections 2.4, 4.3, and Figure 3 caption)

In addition to the summary comparision available in Figure 3 of the main text, here are individual results for all 126 GO terms analyzed.

GOIDTerm NameResults
GO:0000074regulation of progression through cell cycle
GO:0000160two-component signal transduction system (phosphor
GO:0000278mitotic cell cycle
GO:0000279M phase
GO:0000902cellular morphogenesis
GO:0001510RNA methylation
GO:0005975carbohydrate metabolism
GO:0006056mannoprotein metabolism
GO:0006066alcohol metabolism
GO:0006081aldehyde metabolism
GO:0006082organic acid metabolism
GO:0006112energy reserve metabolism
GO:0006118electron transport
GO:0006260DNA replication
GO:0006308DNA catabolism
GO:0006310DNA recombination
GO:0006323DNA packaging
GO:0006352transcription initiation
GO:0006353transcription termination
GO:0006354RNA elongation
GO:0006360transcription from RNA polymerase I promoter
GO:0006366transcription from RNA polymerase II promoter
GO:0006383transcription from RNA polymerase III promoter
GO:0006399tRNA metabolism
GO:0006401RNA catabolism
GO:0006417regulation of protein biosynthesis
GO:0006457protein folding
GO:0006461protein complex assembly
GO:0006473protein amino acid acetylation
GO:0006476protein amino acid deacetylation
GO:0006512ubiquitin cycle
GO:0006519amino acid and derivative metabolism
GO:0006629lipid metabolism
GO:0006725aromatic compound metabolism
GO:0006730one-carbon compound metabolism
GO:0006766vitamin metabolism
GO:0006790sulfur metabolism
GO:0006793phosphorus metabolism
GO:0006800oxygen and reactive oxygen species metabolism
GO:0006807nitrogen compound metabolism
GO:0006811ion transport
GO:0006818hydrogen transport
GO:0006839mitochondrial transport
GO:0006869lipid transport
GO:0006913nucleocytoplasmic transport
GO:0006944membrane fusion
GO:0006970response to osmotic stress
GO:0006974response to DNA damage stimulus
GO:0006986response to unfolded protein
GO:0006997nuclear organization and biogenesis
GO:0007005mitochondrion organization and biogenesis
GO:0007010cytoskeleton organization and biogenesis
GO:0007031peroxisome organization and biogenesis
GO:0007033vacuole organization and biogenesis
GO:0007034vacuolar transport
GO:0007046ribosome biogenesis
GO:0007047cell wall organization and biogenesis
GO:0007059chromosome segregation
GO:0007155cell adhesion
GO:0007166cell surface receptor linked signal transduction
GO:0007243protein kinase cascade
GO:0007264small GTPase mediated signal transduction
GO:0007530sex determination
GO:0008213protein amino acid alkylation
GO:0008219cell death
GO:0008298intracellular mRNA localization
GO:0008380RNA splicing
GO:0008643carbohydrate transport
GO:0009100glycoprotein metabolism
GO:0009116nucleoside metabolism
GO:0009117nucleotide metabolism
GO:0009266response to temperature stimulus
GO:0009308amine metabolism
GO:0009415response to water
GO:0010035response to inorganic substance
GO:0015837amine transport
GO:0015849organic acid transport
GO:0015893drug transport
GO:0015931nucleobase, nucleoside, nucleotide and nucleic aci
GO:0016071mRNA metabolism
GO:0016072rRNA metabolism
GO:0016192vesicle-mediated transport
GO:0016458gene silencing
GO:0016481negative regulation of transcription
GO:0016485protein processing
GO:0018193peptidyl-amino acid modification
GO:0019236response to pheromone
GO:0019748secondary metabolism
GO:0019932second-messenger-mediated signaling
GO:0019953sexual reproduction
GO:0019954asexual reproduction
GO:0030261chromosome condensation
GO:0030447filamentous growth
GO:0030705cytoskeleton-dependent intracellular transport
GO:0031023microtubule organizing center organization and bio
GO:0031123RNA 3'-end processing
GO:0040029regulation of gene expression, epigenetic
GO:0042157lipoprotein metabolism
GO:0042594response to starvation
GO:0043094metabolic compound salvage
GO:0043284biopolymer biosynthesis
GO:0045184establishment of protein localization
GO:0045185maintenance of protein localization
GO:0045333cellular respiration
GO:0045454cell redox homeostasis
GO:0045941positive regulation of transcription
GO:0046483heterocycle metabolism
GO:0048284organelle fusion
GO:0048308organelle inheritance
GO:0050790regulation of enzyme activity
GO:0050801ion homeostasis
GO:0051052regulation of DNA metabolism
GO:0051169nuclear transport
GO:0051186cofactor metabolism
GO:0051236establishment of RNA localization
GO:0051301cell division
GO:0051321meiotic cell cycle

ARP8 predictions and verification (referenced in sections 4.4.1 and Figure 6 caption)

SPELL predicts that the un-annotated gene, ARP8 is involved in the following 13 biological processes which break down into 3 main classes:

Supplemental Table 2
Predicted GO term for Arp8Class
mitotic cell cycleCell Cycle
interphaseCell Cycle
regulation of progression through cell cycleCell Cycle
cell divisionCell Cycle
asexual reproductionCell Cycle
transcription from RNA polymerase II promoterTranscription
negative regulation of transcriptionTranscription
positive regulation of transcriptionTranscription
transcription initiationTranscription
mRNA metabolismTranscription
cellular morphogenesisMorphology
cytoskeleton organization and biogenesisMorphology
response to osmotic stressOther

Cell volume was determined using the Z2 automated cell counter (Beckman Coulter, Fullerton, California, United States). Culture was diluted into Isotone II buffer for the measurement. Cell morphology was determined using a 40x objective on a Zeiss Axioskop (Germany). The entire field of view is shown for both wild-type yeast and the arp8 deletion allowing for direct comparison of the images in Figure 6.