Functional Discovery via a Compendium of Expression Profiles



Hughes, T., M. Marton, A. Jones, C. J. Roberts, R. Stoughton, C. Armour, H. Bennett, E. Coffey, H. Dai, Y. He, M. J. Kidd, A. King, M. Meyer, David L. Slade, P. Lum, Sergey B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, Julian Simon, M. Bard and S. Friend. “Functional Discovery via a Compendium of Expression Profiles.” Cell 102 (2000): 109-126. [Paper Link]


In what now seems like an obvious statement in 2021, this 2000 paper demonstrates that it is possible to indicate the function of a gene product by its co-expression profile.

The idea is simple:

  1. Build a database of reference expression profiles that indicate function
  2. Observe the expression profiles of a gene with "an uncharacterized mutation"
  3. Refer to the expression profile database to ascertain the uncharacterized mutations function

People who've run these types of analyses before understand that this is a simple/clear-cut type of analyses a small fraction of the time. It is very cool when it indeed works.

The authors purport in their introduction to demonstrate their method by collecting a whopping 300 full-genome expression profiles of S. cerevisiae using cDNA microarrays. The expression profiles correspond to mutations in both characterised gene, uncharacterised open reading frames, and drugs with known targets. This article demonstrates that these expressions profiles predict phenotypes which can be then experimentally verified.

Experimental Set-Up

Wow, these folks did a lot of work. They generated

  • 276 deletion mutants
  • 11 tetracycline-regulatable alleles of essential genes
  • 13 compounds with known molecular targets / well-characterised effects

They also doubled up 151 of the mutants to prove things were reproducible. Mutants were chosen so as to reflect various functions. The top 4 classes were

  1. "Cellular Organization" (136)
  2. "Unclassified Proteins" (69)
  3. "Cell Growth, Dvision, DNA Synthesis" (67)
  4. "Metabolism" (57)

Error Modelling

Without any preprocessing, "nearly all" of the mutations resulted in at least a 2x change in abundance for one or more transcript. This would be strange if we were expecting a noiseless system considering (1) they were all grown in a single condition and (2) the mutations were random deletions. It also so happens, however, that among the 63 negative control experiments, there was also always at least on gene with a 2X change in abundance or greater.

The authors identified, from the 63 negative controls, a set of particularly "noisey" genes; many of which are tied to nutrition or stress. This sort of noise underscores for me some of the limitations of expression profiling.

Using an error model of their own devising, however, they’re able to reject all the changes in their negative controls as statistically significant.

What is this error model? Well, its detailed in the Supplementary Data archive in the file ErrModlv2.html.

While detailed, if I'm totally honest, it'd take me quite some time to understand the minutiae of this error model, but I think the most relevant high-level bit is:

Genes that exhibit a variance in the control experiments larger than the mean variance for their abundance are assigned proportionately larger standard errors. Genes that exhibit a variance smaller than the mean are conservatively assigned the mean variance.


The distribution of error magnitudes with respect to the model intensity-dependent variance was observed to be closely normal, so significance values ultimately are derived using the Gaussian error function of the ratio of observed expression ratio to model error.

The high-level summary is that both gene measurement error (e.g. instrument error / variation in reading the fluorescence intensity) and biological variation (e.g. sensitive genes who respond to things like nutrition, or genes which cycle/turnover very quickly).

Once the error model is accounted for, about 50% of mutations had more than 5 significant genes.

Transcriptional Landmarks /Gene Profiles

Using 2D hierarchical clustering, you can identify similar profiles as well as co-regulated genes (comparing gene profiles with one another, vs. comparing genes with one another).

An example of how gene profiles usually are sufficient is how deletions in CUP5 and VMA8, both of which belong to the same H+-ATPase Complex, have pretty much the same gene profiles.

Identification of Cellular Function

Since gene profiles are so tied to their functional profiles, one can use the gene profiles from deletions in genes of known function to infer the function of ORFs with unknown function.

They go on to identify several interesting respiratory genes using this method that I’ll gloss over here as I’m far more interested in the methodology than yeast molecular biology.


The authors cite the generation of the "transcript compendium" and all the associated mutants as being a challenge, especially since they only tested yeast under one condition.

Papersfunction (gene)co-expression (gene)bioinformaticsTimothy R. HughesMatthew J. MartonStephen H. Friend

Relational Inductive Biases, Deep Learning, and Graph Networks