Pathogen Genome Cluster Computing

Clusters Genomes Pipelines Statistics History People Home

Genomics Pipelines

Flooded parallel pipelines form queues

A genomics pipeline is sequential arrangement of bioinformatics programs which are "glued" together using a scripting language, usually Perl. The "glue" enables individual orthologous sequence files to automatically move through one or more stages of the pipeline enabling whole genomes to be swept for phylogenetic signals. When combined with a cluster the approach becomes powerful because large groups of orthologues can be processed in parallel across the computer generating high throughput. The throughput of a cluster enables very rigorous phylogenomics analyses.

 

"People have become the tools of their tools."
Henry David Thoreau

An example of a Perl, BioPerl and UNIX pipeline for trypanosomatid genomics

"Database" image, artists impression of T. cruzi infected human tissue. Credit, Castro Sila, Memorias

 

 

In the above pipeline, files are obtained from remote servers and initially processed in stages 1 to 2, then individual orthologous sequence files sequentially shunted through stages 3 to 6, with stages 5 and 6 reiterated for each separate analysis. All individual outputs from stage 6 are finally processed and entered into a database at stage 7.

A script refers to a core analytical or processing program written in Perl which is not "glue", i.e. does more than simply shunt files around. Numerous functions were obtained from BioPerl modules. Several levels of sequence alignment verification are not shown. The genomes described above have been recently published (Science 309 entire issue).

Next >>
London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, UK | Tel: +44 (0) 20 7636 8636

Comments and enquiries Last updated 28th July, 2005 MWG.