What is a gene/protein/enzyme?

Most people are aware of the abbreviation “DNA” which stands for desoxyribonucleic acid. DNA is the keystone for the development of any living organism and can thus be found in humans, animals, plants, insects, bacteria… It encodes all genetic information which is needed to build a cell. DNA consists of four building blocks:

A - adenine
T - thymine
G - guanine
C - cytosine

These individual building blocks are also called “nucleotides”. A feature of DNA is that these nucleotides are paired. Two pairs exist: A/T and G/C. The reason that these pairs go together are of chemical nature and not of importance at this stage. But this is the reason why DNA is structured as a double helix, a picture some people might have seen before in an animation.

(from: http://ghr.nlm.nih.gov/handbook/illustrations/dnastructure)

A gene is a specific region on the DNA and varies much in length. E.g. the human ADH gene we used in the project contains 1125 nucleotides (base pairs).

> human ADH1

A gene usually encodes a protein. How does that work?
What appears to be a random sequence of these four letters turns out to be a highly specific genetic code that is actively translated by the cellular machinery. A gene is basically the matrix of a protein. When a gene is active in a cell, it is copied into single-stranded template matrix called messenger RNA (mRNA). In this molecule the nucleotide ‘T’ is replaced by an “uracil” (U). So, the initial three nucleotides “ATG” are converted into “AUG”. Such mRNA molecules are recognised by the cell and translated into a new sequence: The protein sequence. This protein sequence consists of new building blocks that are called “amino acids”.

The translation of a DNA/mRNA sequence into a protein follows strict rules and again a specific code. Everybody can translate these sequences when having the appropriate decoder. It can be represented as a wheel.

Each amino acid is encoded by a triplet of nucleotides. Each of the 20 amino acids is encoded by more than one triplet. An exception is the amino acid “methionine” (M) which is only encoded by the nucleotide triplet “AUG”. The methionine is always the first amino acid of a protein sequence. Termination of a protein sequences requires a STOP codon such as “UGA” that is found in human ADH.

Let’s translate the human ADH1 gene sequence into the protein sequence:
The first three nucleotides of the coding sequence (CDS) are A, U, G. Thus,


The second triplet is: GGC. Following the wheel from the inside to the outside (from the first to the third letter, respectively), it appears that these three nucleotides encode the amino acid “glycine” (G). So,


When continuing like this we end up with the following protein sequence:

> human ADH1

As all amino acids are encoded by a triplet of three nucleotides the ADH protein sequence is [1125 nucleotides - STOP(=3 nucleotides)]:3=374 amino acids.
This sequence is called the primary protein sequences. In order to gain functionality all primary sequences fold and coil into a complex structure following thermodynamic rules. The steps until the final tertiary structure of a protein are not of any importance for the current project. More information can be found at:


To make an already complicated story even more complex, it has to be mentioned that even if the protein alcohol dehydrogenase can be found in all organisms, there are species-specific differences in between the sequences that have no or only some minor impact on the functionality of the protein. We can see these differences by sequence comparisons (alignment) using a mathematical algorithm such as

ClustalW (http://www.ebi.ac.uk/Tools/clustalw2/index.html).

As can be seen in the partial alignment below there are several amino acids that are different between the organisms while other residues are fully conserved (indicated by an ‘*’). Identities of different organisms are given in the table further down in the text.