POSTER NO: 310 How random are nucleotide sequences in coding regions?
Fumihiko Takeuchi, Kenji Yamamoto, Hiroshi Yoshikura We studied real and theoretical frequencies of amino acids in coding regions of DNA sequences. The real frequency is the frequency of amino acid in a translated peptide. The theoretical frequency, introduced by King & Jukes, is the frequency calculated from the nucleotide ratio. Each amino acid in each protein was plotted in the 2-D space defined by the real value in the horizontal axis and the theoretical value in the vertical axis. If the nucleotides, t, c, a, g in coding regions were aligned uniformly random, the plot will be on the line which crosses the origin point at an angle of 45 degree. Two hundred and seventy protein-encoding genes each from Pyrococcus abyssi, E. coli, Saccharomyces cervisiae, and Homo sapiens were examined. The plots of mean real and theoretical amino acid frequencies of 270 proteins came near the line, indicating that the alignment of nucleotides was quite random according to the criteria in the four species belonging to different phyla. When the same plot was made for individual amino acids, amino acids showed characteristic plot patterns which were common to all the four species.
Real and theoretical frequencies of amino acids: Means
All the plots, except that of stop codons, come close to the diagonal, indicating that the frequencies of amino acids of existing proteins are close to the ones expected from the randomly re-aligned nucleotide sequences. A closer look shows that
Our study with the current database confirms the previous observations by Jukes' group. In addition, we find the best match is in Homo sapiens and Saccharomyces cervisiae and poorer match in E. coli and Pyrococcus abyssi.
Correlations between nucleotides t,c,a,g:
An interesting observation is that some pairs have positive or near zero correlations. In Pyrococcus abyssi, t and c, a and g, c and g have near zero correlations. In E. coli, c and g have strong positive correlation. In Saccharomyces cervisiae, all pairs have negative correlation. Each species seem to have its own pattern. Among all pairs, the pair c and g has the largest minimum -0.139 throughout the four species. Remark that, since the sum of the frequencies equals one, if the coding regions were made completely random, the correlations would be negative between any pairs. In fact, we simulated the frequencies by choosing the frequency of each nucleotide for a trial according to its distribution in the sample coding regions but independently. Since their sum in this case can be different from one, we scaled by dividing by their sum. Such simulations gave correlations of -0.47~-0.21 for Pyrococcus abyssi, for example. |