Poster 310 - How random are nucleotide sequences in coding regions?

HGM2001 Poster Abstracts: 9. Genome Informatics

POSTER NO: 310

How random are nucleotide sequences in coding regions?

Fumihiko Takeuchi, Kenji Yamamoto, Hiroshi Yoshikura
Research Institute, International Medical Center of Japan, 162-8655, Japan

We studied real and theoretical frequencies of amino acids in coding regions of DNA sequences. The real frequency is the frequency of amino acid in a translated peptide. The theoretical frequency, introduced by King & Jukes, is the frequency calculated from the nucleotide ratio. Each amino acid in each protein was plotted in the 2-D space defined by the real value in the horizontal axis and the theoretical value in the vertical axis. If the nucleotides, t, c, a, g in coding regions were aligned uniformly random, the plot will be on the line which crosses the origin point at an angle of 45 degree. Two hundred and seventy protein-encoding genes each from Pyrococcus abyssi, E. coli, Saccharomyces cervisiae, and Homo sapiens were examined. The plots of mean real and theoretical amino acid frequencies of 270 proteins came near the line, indicating that the alignment of nucleotides was quite random according to the criteria in the four species belonging to different phyla. When the same plot was made for individual amino acids, amino acids showed characteristic plot patterns which were common to all the four species.

Real and theoretical frequencies of amino acids: Means
The real (x) and theoretical (y) amino acid frequencies are calculated for each protein. The mean values of the real and theoretical amino acid frequencies of 270 proteins (x' and y', respectively) are then plotted in the coordinate plane with the real frequencies in the horizontal axis and the theoretical frequencies in the vertical axis. Here, though Jukes' group calculated y by using 1:1:1:1 nucleotide ratio (Jukes et al., 1975) or nucleotide ratio of the whole examined regions (King & Jukes, 1969), we calculate it for each protein by using nucleotide ratio of the protein-coding region. The plot of (x',y') for the four species are shown. (Fig. 1)

All the plots, except that of stop codons, come close to the diagonal, indicating that the frequencies of amino acids of existing proteins are close to the ones expected from the randomly re-aligned nucleotide sequences. A closer look shows that

Cysteine (C), Arginine (R), Histidine (H), Tryptophan (W) have smaller real frequency than theoretical, and
Glutamic acid (E), Aspartic acid (D), Lysine (L), Alanine (A), Phenylalanine (F) have larger real frequency than theoretical.

The smaller real frequency of Cysteine, Arginine, Histidine matches the description in (Jukes et al., 1975) (King & Jukes, 1969). They explain that Cysteine and Histidine are deficient because these amino acids have special functions. Also, the larger real frequency of Glutamic acid and Aspartic acid is explained by charge neutrality. Alanine is explained to be abundant because of its function as a 'filler'.

Our study with the current database confirms the previous observations by Jukes' group. In addition, we find the best match is in Homo sapiens and Saccharomyces cervisiae and poorer match in E. coli and Pyrococcus abyssi.

Correlations between nucleotides t,c,a,g:
We next consider the correlation between the frequencies of nucleotides t,c,a,g in coding regions. We first calculate the correlations and give plots visualizing them. In the plots, a point corresponds to a coding region. (Fig. 5)

An interesting observation is that some pairs have positive or near zero correlations. In Pyrococcus abyssi, t and c, a and g, c and g have near zero correlations. In E. coli, c and g have strong positive correlation. In Saccharomyces cervisiae, all pairs have negative correlation. Each species seem to have its own pattern. Among all pairs, the pair c and g has the largest minimum -0.139 throughout the four species. Remark that, since the sum of the frequencies equals one, if the coding regions were made completely random, the correlations would be negative between any pairs. In fact, we simulated the frequencies by choosing the frequency of each nucleotide for a trial according to its distribution in the sample coding regions but independently. Since their sum in this case can be different from one, we scaled by dividing by their sum. Such simulations gave correlations of -0.47~-0.21 for Pyrococcus abyssi, for example.