Bioinformatics
Bioinformatics
and computational biology involve the use of techniques including applied
mathematics, informatics, statistics, computer science, chemistry and
biochemistry to solve biological problems usually on the molecular level. Research
in computational biology often overlaps
with systems biology. Major research efforts in the field include sequence
alignment, gene finding, genome assembly, protein structure alignment, protein
structure prediction, prediction of gene expression and protein-protein
interactions, and the modeling of evolution.
The
terms bioinformatics and computational biology are often used interchangeably. However bioinformatics
more properly refers to the creation and advancement of algorithms,
computational and statistical techniques, and theory to solve formal and
practical problems posed by or inspired from the management and analysis of
biological data. Computational biology, on the other hand, refers to
hypothesis-driven investigation of a specific biological problem using
computers, carried out with experimental and simulated data, with the primary goal of discovery and the
advancement of biological knowledge. A similar distinction is made by National
Institutes of Health in their working definitions of Bioinformatics and
Computational Biology], where it is further emphasized that there is a tight coupling of developments and knowledge
between the more hypothesis-driven
research in computational biology and technique-driven research in
bioinformatics. Computational biology also includes lesser known but equally
important subdisciplines such as computational
biochemistry and computational biophysics.
A common
thread in projects in bioinformatics and computational biology is the use of
mathematical tools to extract useful information from data produced by
high-throughput biological techniques such as genome sequencing. A
representative problem in bioinformatics is the assembly of high-quality genome
sequences from fragmentary "shotgun" DNA sequencing. Other common
problems include the study of gene regulation using data from microarrays or
mass spectrometry.
Task 1:
Vocabulary – Match the words in the table below with their definitions
1 overlap |
A A collection of DNA spots attached to a
solid surface for the purpose of expression profiling |
|
2 interchangeable |
B main |
|
3 primary |
C partially covering same area |
|
4 coupling |
D studies based on a supposed explanation of a scientific
problem or phenomena |
|
5 hypothesis-driven
research |
E capable of being used in place of each other |
|
6 microarray |
F joining together |
|
Major
Research Areas
Since
the Phage -X174 was sequenced in
1977, the DNA sequences of hundreds of organisms have been decoded and stored
in databases. This data is analyzed to determine genes that code for proteins,
as well as regulatory sequences. A comparison of genes within a species or
between different species can show similarities between protein functions, or
relations between species (the use of molecular systematics
to construct phylogenetic
trees). With the growing amount of data, it long ago became impractical to
analyze DNA sequences manually. Today, computer programs are used to search the
genome of thousands of organisms, containing billions of nucleotides. These
programs can compensate for mutations
(exchanged, deleted or inserted bases) in the DNA sequence, in order to
identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing
process itself. The so-called shotgun sequencing technique (which was used, for
example, by The Institute for Genomic Research to sequence the first bacterial
genome, Haemophilus influenza) does not give a
sequential list of nucleotides, but instead the sequences of thousands of small
DNA fragments (each about 600-800 nucleotides long). The ends of these fragments overlap and, when aligned in
the right way, make up the complete genome. Shotgun sequencing yields sequence
data quickly, but the task of assembling the fragments can be quite complicated
for larger genomes. In the case of the Human Genome Project, it took several
months of CPU time (on a circa-2000 vintage DEC Alpha computer) to assemble the fragments. Shotgun
sequencing is the method of choice for virtually all genomes sequenced today,
and genome assembly algorithms are a critical area of bioinformatics research.
Another
aspect of bioinformatics in sequence analysis is the automatic search for genes
and regulatory sequences within a genome.
Not all of the nucleotides within a genome are genes. Within the genome of
higher organisms, large parts of the DNA do not serve any obvious purpose. This
so-called junk DNA may, however, contain unrecognized functional elements.
Bioinformatics helps to bridge the gap between genome and proteome
projects--for example, in the use of DNA sequences for protein identification.
Task 2: Vocabulary – Look at the
following words. Find them in the text and
decide what kind of word they are and then tick the correct box in the table
below. Then try to fill in the remaining boxes with examples of the same basic
word in a different form.
Word in text |
noun |
verb |
adjective |
adverb |
more |
sequenced |
|
|
|
|
|
phylogenetic |
|
|
|
|
|
mutation |
|
|
|
|
|
variant |
|
|
|
|
|
fragment |
|
|
|
|
|
assemble |
|
|
|
|
|
regulatory |
|
|
|
|
|
Genome annotation
In the
context of genomics, annotation is the process of marking the genes and other
biological features in a DNA sequence. The first genome annotation software
system was designed in 1995 by Owen
White, who was part of the team that sequenced and analyzed the first genome of
a free-living organism to be decoded, the bacterium Haemophilus
influenzae. Dr. White built a software system to find the genes (places in the DNA
sequence that encode a protein), the transfer RNA, and other features, and to
make initial assignments of function to those genes. Most current genome
annotation systems work similarly, but the programs available for analysis of
genomic DNA are constantly changing and
improving.
Task 3: Grammar – Decide if the verbs ewas
designedf and ebuiltf
and eare changingf
are active or passive, then rewrite the sentence
using the same verb in the opposite form.
Computational evolutionary
biology
Evolutionary
biology is the study of the origin and descent of species, as well as their
change over time. Informatics has assisted evolutionary biologists in several
key ways; it has enabled researchers to:
l
trace
the evolution of a large number of organisms by measuring changes in their DNA,
rather than through physical taxonomy or physiological observations alone,
l
more
recently, compare entire genomes, which permits the study of more complex
evolutionary events, such as gene duplication, lateral gene transfer, and the
prediction of bacterial speciation factors,
l
build
complex computational models of populations to predict the outcome of the
system over time
l
track
and share information on an increasingly large number of species and organisms
Future
work endeavours to reconstruct the now more complex
tree of life.
The area
of research within computer science that uses genetic algorithms is sometimes
confused with computational evolutionary biology, but the two areas are
unrelated.
Task 4: Synonyms – the words
below are synonyms for some of the words in the Computaional
evolutionary biology section above. Find the words in the text that they are
synonyms for.
complicated, copying, classification, allows, forecast, complete, beginning,
history
Measuring biodiversity
Biodiversity
of an ecosystem might be defined as the total genomic complement of a particular
environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop of sea water, a scoop
of soil, or the entire biosphere of the planet Earth. Databases are used to
collect the species names, descriptions, distributions, genetic information,
status and size of populations, habitat needs, and how each organism interacts
with other species. Specialized software programs are used to find, visualize,
and analyze the information, and most importantly, communicate it to other
people. Computer simulations model such things as population dynamics, or
calculate the cumulative genetic health of a breeding pool (in agriculture) or
endangered population (in conservation). One very exciting potential of this
field is that entire DNA sequences, or genomes of endangered species can be
preserved, allowing the results of Nature's genetic experiment to be remembered
in silico, and possibly reused in the future, even if
that species is eventually lost.
1 What is
biodiversity?
2 How many
examples of biodiversity does the text give?
3 What can
computer simulations model?
4 In your own words, what does the writer say is very
exciting?
Analysis of _____________________ .
The expression
of many genes can be determined by measuring mRNA levels with multiple
techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene
expression (SAGE) tag sequencing, massively parallel signature sequencing
(MPSS), or various applications of multiplexed in-situ hybridization. All of
these techniques are extremely noise-prone and/or subject to bias in the
biological measurement, and a major research area in computational biology
involves developing statistical tools to separate signal from noise in
high-throughput gene expression studies. Such studies are often used to
determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data
from non-cancerous cells to determine the transcripts that are up-regulated and
down-regulated in a particular population of cancer cells.
Task 6: Main point –
What would be a good title for the section above?
Analysis of regulation
Regulation
is the _________ (1) orchestration of events starting with an
__________ (2)-cellular signal and ultimately leading to an increase or
decrease in the activity of one or more protein molecules. Bioinformatics
techniques have been applied to explore various steps in this process. For
example, promoter analysis involves the elucidation and study of sequence
motifs in the genomic region surrounding the coding region of a __________ (3).
These motifs influence the extent to which that region is transcribed into
mRNA. Expression data can be used to __________ (4) gene regulation: one might
compare microarray __________ (5) from a wide variety
of states of an organism to form __________ (6) about the genes involved in
each state. In a single-cell organism, one might compare stages of the cell
cycle, along with various stress conditions (heat shock, starvation, etc.). One
can then apply clustering algorithms to that expression data to determine which
genes are co-expressed. For example, the upstream regions (__________ (7)) of
co-expressed genes can be searched for over-represented regulatory elements.
Take
the words listed below and insert them in the correct place in the text.
extra, hypotheses, complex, promoters, infer, data, gene
Analysis of protein
expression
Protein microarrays and high throughput (HT) mass spectrometry (MS)
can provide a snapshot of the proteins present in a biological sample.
Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces
similar problems as with microarrays targeted at
mRNA, the latter involves the problem of matching large amounts of mass data
against predicted masses from protein sequence databases, and the complicated
statistical analysis of samples where multiple, but incomplete peptides from
each protein are detected.
In your OWN WORDS explain what exactly a microarray
is
Analysis of mutations in
cancer
Massive
sequencing efforts are currently underway to identify point mutations in a
variety of genes in cancer. The sheer volume of data produced requires
automated systems to read sequence data, and to compare the sequencing results
to the known sequence of the human genome, including known germline
polymorphisms.
Oligonucleotide microarrays,
including comparative genomic hybridization and single nucleotide polymorphism
arrays, able to probe simultaneously up to several hundred thousand sites
throughout the genome, are being used to identify chromosomal gains and losses
in cancer. Hidden Markov model and change-point analysis methods are being
developed to infer real copy number changes from often noisy data. Further
informatics approaches are being developed to understand the implications of
lesions found to be recurrent across many tumors.
Some
modern tools (e.g. Quantum 3.1) provide a tool for changing the protein
sequence at specific sites through alterations to its amino acids and predict
changes in the bioactivity after mutations.
Task 9:
Critical thinking –
1 Why do you think eMassive
sequencing efforts are currently underway to identify point mutations in a
variety of genes in cancerf?
2 Why are computers
necessary in the process of reading sequence data?
3 Why are eoligonucleotide microarraysf being
used to echromosomal gains and losses in cancerf?
4 What do you think
the next version of the Quantum tool will be called?
Prediction of protein
structure
A One of the key ideas in
bioinformatics is the notion of homology. In the genomic branch of bioinformatics,
homology is used to predict the function of a gene: if the sequence of gene A,
whose function is known, is homologous to the sequence of gene B, whose
function is unknown, one could infer that B may share A's function. In the
structural branch of bioinformatics, homology is used to determine which parts
of a protein are important in structure formation and interaction with other
proteins. In a technique called homology modelling,
this information is used to predict the structure of a protein once the
structure of a homologous protein is known. This currently remains the only way
to predict protein structures reliably.
B Other techniques for predicting
protein structure include protein threading and de novo (from scratch)
physics-based modeling.
C One example of this is the
similar protein homology between haemoglobin in
humans and the haemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting
oxygen in the organism. Though both of these proteins have completely different
amino acid sequences, their protein structures are virtually identical, which
reflects their near identical purposes.
D Protein structure prediction is
another important application of bioinformatics. The amino acid sequence of a
protein, the so-called primary structure, can be easily determined from the
sequence on the gene that codes for it. In the vast majority of cases, this
primary structure uniquely determines a structure in its native environment.
(Of course, there are exceptions, such as the bovine spongiform encephalopathy
- aka Mad Cow Disease - prion.)
Knowledge of this structure is vital in understanding the function of the
protein. For lack of better terms, structural information is usually classified
as one of secondary, tertiary and quaternary structure. A viable general
solution to such predictions remains an open problem. As of now, most efforts
have been directed towards heuristics that work most of the time.
Task
10: Jumbled text –
The
section above has been jumbled. Put the paragraphs in their correct order.
Para1 |
|
|
|
|
|
|
|
The core
of comparative genome analysis is the establishment of the correspondence
between genes (orthology analysis) or other genomic
features in different organisms. It is these intergenomic
maps that make it possible to trace
the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational
levels shape genome evolution. At the lowest level, point mutations
affect individual nucleotides. At a higher level, large chromosomal segments
undergo duplication, lateral transfer, inversion, transposition, deletion and
insertion. Ultimately, whole genomes are involved in processes of
hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity
of genome evolution poses many exciting challenges to developers of
mathematical models and algorithms, who have recourse to a spectra of
algorithmic, statistical and mathematical techniques, ranging from exact, heuristics,
fixed parameter and approximation algorithms for problems based on parsimony
models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems
based on probabilistic models.
Many of
these studies are based on the homology detection and protein families computation.
Task
11: Comprehension –
1 What is eorthology analysisf.
2 How many algorithmic,
statistical and mathematical techniques are mentioned in the section above?
3 Having answered
question 2, what do you think espectraf mean?
4 What algorithms
are used to solve parsimony model problems?
5 What algorithms
are used to solve probabilistic model problems?
Modeling biological systems
Systems
biology involves the use of computer simulations of cellular subsystems (such
as the networks of metabolites and enzymes which comprise metabolism, signal
transduction pathways and gene regulatory networks) to both analyze and
visualize the complex connections of these cellular processes. Artificial life
or virtual evolution attempts to understand evolutionary processes via the
computer simulation of simple (artificial) life forms.
High-throughput image analysis
Computational
technologies are used to accelerate or fully automate the processing,
quantification and analysis of large amounts of high-information-content biomedical
imagery. Modern image analysis systems augment an observer's ability to make
measurements from a large or complex set of images, by improving accuracy, objectivity,
or speed. A fully developed analysis system may completely replace the
observer. Although these systems are not unique to biomedical imagery,
biomedical imaging is becoming more important for both diagnostics and
research. Some examples are:
l
high-throughput
and high-fidelity quantification and sub-cellular localization (high-content
screening, cytohistopathology)
l
morphometrics
l
clinical
image analysis and visualization
l
determining
the real-time air-flow patterns in breathing lungs of living animals
l
quantifying
occlusion size in real-time imagery from the development of and recovery during
arterial injury
l
making
behavioural observations from extended video
recordings of laboratory animals
l
infrared
measurements for metabolic activity determination
The
computational biology tool best-known among biologists is probably BLAST, an
algorithm for searching large databases of protein or DNA sequences. NCBI
provides a popular implementation that searches their massive sequence
databases. Bioinformatic meta
search engines (Entrez, Bioinformatic
Harvester) help finding relevant information from several databases. There are also free Web-based software designed for structural
bioinformatics such as STING.
Computer
scripting languages such as Perl and Python are often used to interface with biological
databases and parse output from bioinformatics programs. Communities of
bioinformatics programmers have set up free/open source projects such as EMBOSS,
Bioconductor, BioPerl, BioLinux, BioPython, BioRuby, and BioJava which
develop and distribute shared programming tools and objects (as program
modules) that make bioinformatics easier.
Integrated
software workbenches consisting of many free/open source tools described above
and many others could be CLC Free Workbench or VigyaanCD.
Taverna an open-source bioinformatics workbench that utilises a workflow model of experimental design, is
included as part of the myGRID package of e-science
software. Quantum 3.1 is an example of the bioinformatics post-QSAR technology
applying quantum and molecular physics instead of statistical methods. Genevestigator is an example of how large-scale gene
expression microarray data is used to predict gene
function based on contextual information.
More
recently, SOAP-based interfaces have been developed for a wide variety of
bioinformatics applications such as blast, fasta,
EMBOSS, clustalw, t-coffee, MUSCLE and many others.
These are available from the EBI at EBI Web Services.
Task 12:
Question formation –
Make up
ten questions of any type from the three sections above.