Automated Assessment of High Throughput Hypotheses on Gene Regulatory Mechanisms Involved in the Gastrin Response

Introduction

Systems Biology is an integrated approach to build and simulate biological models from a variety of data sources in order to generate and validate new research hypotheses. In its most basic form, a hypothesis may propose a biological relationship between two biological components like a protein and a gene. Such binary relation hypotheses may be the subject of experimental validation and form building blocks for assembly into larger models. Today’s high throughput experiments are capable of producing vast amounts of data that allow the assertion of large numbers of binary relationships. We have devised a general approach to semi-automatically convert experimental data into individual research hypotheses. The formalised hypotheses in our example specify possible interactions between two components that are part of a general regulatory transcriptional network.

Authors

Sushil Tripathi1, Aravind Venkatesan1, Mikel Egaña Aranguren2, Zahra Zavareh1, Konika Chawla1, Vladimir Mironov1, Liv Thommesen1, Torunn Bruland1, Martin Kuiper1, and Astrid Lægreid1

1Norwegian University of Science and Technology, NTNU, Trondheim, Norway; 2Universidad Politécnica de Madrid, Spain

Biological system

Our biological system is an in vitro cell culture treated with gastrin, a stomach peptide hormone with pleiotropic effects. Gene expression time profiles were recorded with transcriptome microarrays at 11 time points covering a 14h gastrin response. We found ~3000 genes with significantly changed mRNA levels (Bruland et al., 2011) and named these ‘target genes’ (TG).

Results

We built a partially automated pipeline that performs reasoning over all possible relationship hypotheses derived from our microarray results. These hypotheses concern binary interactions between putative protein regulators (with mRNA levels serving as their proxy) and target genes:

Protein X – (up/downregulates) – TG Y

Hypotheses are initially formulated by considering every protein encoded by an expressed gene in our model cell line (~10.000) as a candidate regulator of every other expressed gene, resulting in 100 million potential binary interactions. An elaborate reasoning process is then used to ‘upgrade’ or ‘downgrade’ specific hypotheses, based on the assignment of scores. Cumulative scores provide a means to sort hypotheses for supporting evidence and priority for subsequent experimental validation. Scores were derived from a large number of sources including biological background information including Gene Ontology and other available sources of annotation of function of the putative regulators; from gene expression dynamics (genes that do not respond to a stimulus are unlikely target candidates) and timeliness of expression (new protein synthesis needs to be underway before a transcription factor (TF) whose activity is regulated by its gene expression can affect the regulation of its target genes). An automated query pipeline searched for supporting information both from our own Datamart with experimental findings and public annotations of transcriptional regulation, and through federated querying against distributed SPARQL endpoints. Several examples of high scoring TF-TG gene pairs illustrate the power of our approach.

References

Bruland T, Flatberg A, Andersen E, Misund K,  Fjeldbo CS,  Thommesen L. Lægreid A, Exploring signal-induced cellular regulatory subnetworks by use of partial least square regression (PLSR) multidimensional analysis of gene expression time series data. Manuscript