Genomics statistics

Pathogen Genome Cluster Computing

Statistics and Computer Architecture
The cluster at LSHTM is simply a stack of 64-bit 2 Gig Hz processors, each processor with 1 Gigabyte RAM. It is designed around floating point-based number crunching, which is light on RAM but processor intensive. Traditional PC architecture uses 32-bit processors and is most likely the processor in the machine you are using to view this web site. A software itinerary of the machine is available here. 64-bit processors are hugely advantageous at floating point-based calculations and are excellent for processor intensive single threaded jobs such as maximum likelihood. 32-bit processors are fast at integer-based calculations, which to some extent benefit certain microsimulations. When deciding on any processor, RAM, motherboard combinations rigorous benchmarking must be performed to ensure the cost versus throughput is optimised for the type of calculations of interest. The SPEC benchmarks are a good starting point. In summary, number crunching is not a one size fits all system. This may seem confusing because supercomputer jargon is loaded with "teraflops", where one teraflop is one trillion calculations per second. Teraflops are in fact benchmarked as floating point calculations, but even then it still depends on the type of calculations under investigation.
Maximum likelihood	"The research rat of the future allows experimentation without manipulation of the real world. This is the cutting edge of modeling technology." John Spencer Microsimulation of a triatomine bug, the vector of Chagas disease. Host behaviour is modeled through the inversed gamma distribution, i.e. where a (random) probability maps onto a variate.
Maximum likelihood seeks to search a probability landscape for the solution with the highest (maximum) probability using a predefined model. The approach is widespread in phylogenetics in searching for different tree structures and this in turn is the bricks and mortar of phylogenomics analysis. Phylogenetics seeks to model either DNA or amino acid point mutations, although we have applied it to modeling the presence or absence of genes from DNA-DNA microarray data. Phylogenetic nucleotide models center on recovering information lost due to reversion mutations, particularly at "neutral" sites. Elaborations on this model incorporate skews in purine pyrimidine mutation rates, base homogeneity and site rate heterogeneity. Parameter rich models are processor hungry in an approach which is notoriously processor hungry. The combination of maximum likelihood based models across entire genomes results in highly computationally demanding calculations. Bayesian approaches use the same model based approach to reconstruct a phylogenetic tree but obtain the solution via simulation notably using Markov Chain Monte Carlo.
Stochastic Approaches
Stochastic simulations, or microsimulations, which model individual behaviours, are very RAM intensive and RAM is often the limitation of this approach. When the population size of the microsimulation exceeds the RAM of a machine the calculation freezes (from bitter experience). A computer designed around simulation usually contains a smaller number of processors each with a huge RAM capacity.
• Next >>

London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, UK | Tel: +44 (0) 20 7636 8636

Comments and enquiries Last updated 28th July, 2005 MWG.