Why are there three stop codons but only one start codon?

Why are there three stop codons but only one start codon?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I was wondering whether there is any specific reason that there are three stop codons but only one start codon in prokaryotic and eukaryotic cytoplasmic mRNAs.


It is important to realize that it is not possible to answer questions of this sort (the evolution of translation) in anything other than a speculative manner. Please bear this in mind, especially if you find the speculation below persuasive.


It is suggested that the redundancy of termination codons is largely similar to that for 'elongation codons' decoded at the A-site of the ribosome, although protection against suppressor tRNAs may also be a factor. The consideration here is, therefore, mainly focussed on the question why there is only a single codon for methionine (initiation and elongation), and I suggest this may have been a combination of the late appearance and infrequent usage of methionine, together with the particular requirements for an initiation tRNA at the P-site of the ribosome, where standard wobble is apparently not possible.

Disposing of a Red Herring

Although alternative initiation codons are sometimes used, this usage differs in prokaryotes and eukaryotes, is very limited, and is not seen with the use of methionine in elongation. I would therefore consider this does not invalidate the question, but perhaps rephrase it as:

“Why does the genetic code have only one codon assigned to methionine, the initiation codon, whereas other amino acids and termination have generally more than one?”

Time of appearance of and extent of use of Methionine

It is generally thought that genetic code initially encoded a small number of amino acids, and that this number increased over time. Methionine is one of the most complex (and energetically expensive) to synthesize (see Wikipedia entry), so is likely to have appeared late. One can argue that most of the codons would have been 'taken' (assigned) by that time. This argument is consistent with the situation for tryptophan, the other amino acid which has only a single codon, and which is also synthetically complex. Furthermore, these are the two least abundant amino acids so on can argue that their number of codons reflects their relative usage.

Admittedly, there need only be one stop codon per protein, but it is possible that the mechanism of termination evolved before that of methionine-specific initiation. (Initiation might originally have been from the 5'-end of the mRNA.) There is also the question of protection against suppressor tRNAs, but I shall delay discussion of this until the end.

The unique features of the initiator tRNA

Elongation involves several dozen amino acyl-tRNAs being recognized by an elongation factor which binds them to the A-site of the ribosome, at which 3'-codon wobble occurs where the redundancy of the genetic code allows and the anticodon of the tRNA has evolved appropriately.

In contrast initiation involves exclusive recognition of a special tRNA by initiation factors but not elongation factors, and subsequent binding at the P-site The tRNA has unique structural features that allows it to be N-formylated - structural features that are retained in eukaryotic initiator tRNA, even though there is no transformylase in eukaryotes. (see this SE Question]

This uniqueness may explain why there is only one tRNA for initiation. However one might ask why there is a not a group of methionine codons (e.g. AUA as well as AUG) recognized by the same initiator tRNA through wobble. The reason for this would seem to be that the P-site does not allow third position wobble (as I explain in the answer to a related question), so the initiation met-tRNA would be codon-specific.

Loose Ends

Although perhaps straying a little from the question, one might still ask why there are not alternative elongation codons for Met - e.g. AUA, which could have been 'stolen' from Ile, which already has AUU and AUC. In fact in mammalian mitochondria this codon reassignment has happened, but unfortunately does not help explain things. In mitochondria there is only one tRNAmet for both these codons, consistent with the simplification of tRNAs for the small mammalian mitochondrial genome. Initiation requires the formylation of the methionine.

I will finish by returning to the question of multiple termination codons, which are not really consistent with my suggestion that the extent of redundancy reflects usage. Although termination codons, like elongation codons, are recognized at the A-site they are recognized by protein release factors rather than tRNAs. This means that they will be affected differently by a mutation in a tRNA that alters the anticodon. For an amino acid codon, an altered non-cognate tRNA would only cause some misreading and an altered cognate tRNA might be neutralized by wobble or by duplicate or redundant tRNA genes. AS termination codons are not normally decoded by tRNAs, a suppressor tRNA would cause read-through at all the suppressed codons. The presence of two alternatives means that termination would be likely to occur within about 30 codons 3' to the original termination codon. Variation in the length of the N- and C-termini of proteins is common among related species, and thus need not be deleterious to the cell.

Related SE Questions

  • 'Wobble' in Initiation and Elongation - I have contributed an answer to this.

  • Why is AUG the Initiation Codon? - I will respond to this presently.

  • Redundancy of the Genetic Code - I have contributed an answer to this.

  • Initiator and Elongator tRNAs in Translation - I have contributed an answer to this.

The simplest answer is that redundancy has evolved into these codons. A cell would not want a mutation at the stop location and have the protein get excessively large before another stop codon is hit.

Here is a better explanation and if you are very interested in the subject, I would recommend Lehningers Principles of Biochemistry.

Lehninger, A. L., Nelson, D. L., & Cox, M. M. (2000). Lehninger principles of biochemistry. New York: Worth Publishers

Somewhat obvious, there's weaker selection on the stop codons than on the start codons:

The selection affecting stop codons is relatively weak. In particular, comparison of the strength of purifying selection to that on start codons indicates that purifying selection on UAA is slightly lower than that on GUG and UUG start codons and much weaker than purifying selection on AUG, the primary start codon.

The reasons for this difference are a bit more difficult to grasp… but even for stop codons…

UAA is the optimal stop codon based on its higher proportion in highly expressed genes

On the other hand there's drift ("switches") in the stop codons

Fast-evolving genes generally accrue more stop codon switches than slow-evolving genes, suggesting that the higher evolutionary rate involves also the stop codons.

So come back in a few billion years with the same question on the stop codons.

On the other hand it's fairly well-established that

In prokaryotes, the start codon is one of the major translation initiation determinants: replacement of AUG with an alternative start codon, such as GUG, typically leads to a several-fold drop in the translation efficiency

As for the stop codons:

The causes of the observed preference for UAA as the stop codon, particularly in highly expressed, slow-evolving genes remain unknown. One potentially plausible possibility is that UAA is less prone to formation of stable secondary structures in RNA molecules than UAG or UGA which facilitates the release factor access and is likely to be particularly relevant for highly expressed genes. Furthermore, the frequency of readthrough differs for the different stop codons and is the highest for UGA, at least, in E. coli. Because the deleterious effect of readthrough is the greatest for abundant proteins, the difference in readthrough frequencies could, in part, explain the strong preference for UAA in genes encoding such proteins.

So, in summary, the selection reason for the start codon is fairly well understood, but the reason for the observed preference for UAA as an optimal stop codon (in highly expressed genes) is more speculative, which goes hand in hand with its purifying selection not being as strong as that for the AUG start codon.

How can there be 64 codon combinations but only 20 possible amino acids?

Codons are three letter genetic words: and the language of genes use 4 letters (=nitrogenous bases). Hence 64 words are there in genetic dictionary, to represent 20 amino acids that the biological organisms use.


And you must note that more than one codon may code for the same amino acid. This is referred to as degeneracy of the code.

For example, three amino acids are coded by any of six different codons, and that alone uses up 18 of the 64 combinations.

Three of the codons are stop codons.

They do not code for any amino acid.

Instead, they act as signals to end the genetic message carried by messenger RNA .

The number of amino acids coded by codons is

#1 " codon" × color(white)(l)2 " amino acids" = color(white)(ll)2 " codons"#
#2 " codons" × 9 " amino acids" = 18 " codons"#
#3 " codons" × 1 " amino acid" = color(white)(X)3 " codons"#
#4 " codons" × 5 " amino acids" = 20 " codons"#
#6 " codons" × 3 " amino acids" = 18 " codons"#
#color(white)(XXXXXXXXXXXXXXXX)3" stop codons"#
#stackrel(—————————————————————————)(color(white)(XXXXXXXXXl)"TOTAL" = 64 " codons")#

Alternative Start Codons: Non-AUG Translation

Although the most common start codon is AUG, translation initiation can also occur at other codons with a much lower efficiency. Usually, alternative start codons only differ from AUG by one nucleotide (e.g. CUG, GUG and UUG) (Kearse & Wilusz, 2017). Alternative start codons are used by both prokaryotes and eukaryotes, though it is more common in prokaryotes (Asano, 2014). Though the codon may otherwise code for a different amino acid, alternative start codons are still generally translated as methionine in eukaryotes (Kochetov, 2008), and N-formylmethionine in prokaryotes (Belinky, Rogozin & Koonin, 2017).

In eukaryotes, translation usually occurs when 40S ribosomal subunits are recruited to the 5’ cap of mRNA and scan the mRNA in a 5’ to 3’ direction for the AUG start codon. However, the successful recognition of the AUG triplet depends on the nucleotides which surround it (i.e. its nucleotide context) (Kochetov, 2008). The optimal context is called the Kozak motif, and in mammals this is GCCRCCAUGG (AUG start codon in bold). For non-AUG translation to occur, a good Kozak context around the alternative start codon is critical. Additionally, the success of non-AUG translation is also greatly increased when there is a strong RNA secondary structure approximately 15 nt downstream of the initiation site (Ivanov et al., 2011).

In prokaryotes, alternative start codons may be used up to 20% of the time to initiate translation. However, genes starting with AUG are expressed at significantly higher levels compared to other start codons. The AUG start codon is actively maintained through selection, though the purifying selection of start codons is still significantly weaker than the selection of codons in coding sequences. The next most frequently used start codon is GUG, followed by UUG. (Belinky, Rogozin & Koonin, 2017). AUG, GUG and UUG may be thought of as the canonical start codons, but in fact translation initiation has been detected from 47 of the 64 triplet codons in E. coli (Hecht et al., 2017). Unlike the scanning model of translation in eukaryotes, in prokaryotes the ribosome binds directly to the Shine Dalgarno sequence on the mRNA, an important translation initiation signal. When an alternative start codon is used, this tends to be compensated by mutations in the Shine Dalgarno sequence that result in a stronger translation initiation signal (Belinky, Rogozin & Koonin, 2017). A well-known example of a coding region with a non-AUG start codon is lacI in the E. coli lac operon, which starts with GUG (Farabaugh, 1978).

Although alternative start codon use is less common in eukaryotes than in prokaryotes, it still plays an important biological role. The use of alternative start codons can allow the production of several different proteins from a single gene, each with a different N-terminal domain. This contributes to protein diversity. Different isoforms produced using alternative start codons have distinct biological functions. As the N-terminal domain of a protein often contains signal peptides, using alternative start codons is one way to direct proteins to different compartments (Touriol et al., 2003). Up-regulation of non-AUG translation occurs during development and when cells undergo stress. In addition, misregulation of non-AUG translation contributes to the development of cancer and numerous neurological diseases (Kearse & Wilusz, 2017).

Cancer growth can either be promoted or inhibited by non-AUG translation. An example where cancer growth is promoted is the non-AUG translation of fibroblast growth factor 2 (FGF2). If the AUG start codon is used, an 18-kDa protein isoform is formed which is mostly present in the cytosol or secreted. However, there are four upstream CUG codons which can act as alternative start codons. The protein isoforms produced from these CUG codons are localised to the nucleus. These CUG-isoforms have been shown to promote cell immortalisation in culture and had tumorigenic properties when injected into mice. It is still poorly understood how alternative start codon use is activated in cancer, but it is possibly due to misregulation of the eukaryotic initiation factors (eIFs) in cancers. (Kearse & Wilusz, 2017).

In addition to cancer, non-AUG translation also plays a role in a number of neurological diseases, called nucleotide repeat disorders. This includes Huntington’s disease and fragile X disorders. These diseases are caused by mutations that increase the number of nucleotide repeats in a gene above a certain threshold. For example, in fragile X disorders affected patients have >55 CGG repeats in the 5’ region of the FMR1 gene, whilst those who are unaffected have a mean number of 30 repeats. The increased number of repeats can induce non-AUG translation, causing what is called repeat-associated non-AUG (RAN) translation. The resulting protein products are toxic to cells, resulting in the neurological disease. It is still unclear why RAN is induced by the number of nucleotide repeats (Kearse & Wilusz, 2017).

Overall, non-AUG translation has wide implications not only for genome annotation in prokaryotes, but also for protein diversity, disease and cell function in eukaryotes. Yet more research needs to be done in prokaryotes to check for unannotated open reading frames with non-canonical start codons and in eukaryotes to elucidate how the use of alternative start codons is induced or repressed.

Asano, K. (2014) Why is start codon selection so precise in eukaryotes? Translation. 2 (1), e28387. Available from: doi: 10.4161/trla.28387.

Belinky, F., Rogozin, I. B. & Koonin, E. V. (2017) Selection on start codons in prokaryotes and potential compensatory nucleotide substitutions. Scientific Reports. 7 (1), 12422. Available from: doi: 10.1038/s41598-017-12619-6.

Farabaugh, P. J. (1978) Sequence of the lacI gene. Nature. 274 (5673), 765-767. Available from: doi: 10.1038/274765a0.

Hecht, A., Glasgow, J., Jaschke, P. R., Bawazer, L. A., Munson, M. S., Cochran, J. R., Endy, D. & Salit, M. (2017) Measurements of translation initiation from all 64 codons in E. coli. Nucleic Acids Research. 45 (7), 3615-3626. Available from: doi: 10.1093/nar/gkx070.

Ivanov, I. P., Firth, A. E., Michel, A. M., Atkins, J. F. & Baranov, P. V. (2011) Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Research. 39 (10), 4220-4234. Available from: doi: 10.1093/nar/gkr007.

Kearse, M. G. & Wilusz, J. E. (2017) Non-AUG translation: a new start for protein synthesis in eukaryotes. Genes & Development. 31 (17), 1717-1731. Available from: doi: 10.1101/gad.305250.117.

Kochetov, A. V. (2008) Alternative translation start sites and hidden coding potential of eukaryotic mRNAs. BioEssays : News and Reviews in Molecular, Cellular and Developmental Biology. 30 (7), 683-691. Available from: doi: 10.1002/bies.20771.

Touriol, C., Bornes, S., Bonnal, S., Audigier, S., Prats, H., Prats, A. & Vagner, S. (2003) Generation of protein isoform diversity by alternative initiation of translation at non-AUG codons. Biology of the Cell. 95 (3), 169-178. Available from: doi: 10.1016/S0248-4900(03)00033-9.


Transcriptome of Amoebophrya sp. ex Karlodinium veneficum

After de novo assembly using Trinity, the RNA-seq data from Amoebophrya ex K. veneficum contained 228,474 sequences, of which 65,096 were longer than 1 kb and 114,938 were longer than 500 bases. Trinity reported 30,970 sequence variants within the same graph component. Based on the distribution of the AT bias histogram (S1 Fig), the sequences were divided in an approximate ratio of one third parasite and two thirds host. There were 74,549 sequences with ≥58% AT content putatively attributed to the parasite [11] of which 38,063 were longer than 500 bases, and 23,160 were longer than 1 kb. Of these sequences 12,793 had additional sequence variants reported by Trinity. Of sequences ≥58% AT content, 1,573 contained CTCAAG within the first 30 bases which matches the 3’ end of the dinoflagellate spliced leader sequence [41]. The remaining sequences with <58% AT content, or the putative host set, had 76,875 sequences longer than 500 bases, and 41,936 sequences longer than 1kb. Of these 19,170 sequences had additional sequence variants. Of all sequences <58% AT content, 11,495 contained a six base or longer spliced leader match within the first 30 bases as described above. All parasite sequences were submitted to TSA with the submission number GGWB00000000 and host sequences are under the accession GGWG00000000. In BLASTx searches against the NCBI reference sequence database using putative host sequences (<58% AT content) over 1 kb, 20,725 had hits with an e-value cutoff of 1e-10 or less. The ≥58% AT content sequences over 1 kb long produced 4,908 hits, of which 3,652 were unique after removing other sequences within the same graph component. For Amoebophrya sp. ex. Akashiwo sanguinea the RNA-seq data assembled using Trinity contained 237,751 sequences. The sequences for host and parasite could not be resolved using AT bias alone (S1 Fig). In addition, a robust naïve host library for A. sanguinea is not available for comparison.

Features of Amoebophrya sp. ex K. veneficum genome assembly

The N50 size of the ABySS assembly of Amoebophrya sp. ex K. veneficum was 21 kb with 15,484 sequences over 1 kb long, while the SPAdes assembly had an N50 of 33 kb with 13,850 sequences over 1kb. The SPAdes coverage was distributed around 14 fold, consistent with a genome size of

130 Mb. The ABySS and SPAdes assemblies contained a total of 123 Mb and 138 Mb in contigs over 1 kb respectively. A bacterial contaminant with a 97% SSU rDNA identity to Kordia algicida OT-1 was also found in the assembled data [42]. The bacterial contaminant was assembled into 80 ABySS contigs and 93 SPAdes scaffolds in the genome assemblies and spanned

4.5 Mb of sequence. This bacterial contaminant was not well represented in the transcriptome data with few BLASTn matches to the genomic data for this bacterium. An unknown quantity of host sequence was contained within the assembled genomic data. Host chloroplast sequences were not robustly assembled by ABySS with only 16 sequences (3 over 1 kb) covering 6,613 bases of the 142 kb chloroplast data determined by Gabrielsen et al. [33] using a BLASTn identity cut-off of 90%. The SPAdes scaffolds contained over 55 kb of chloroplast data in 73 scaffolds with only 13 over 1 kb. The K. veneficum host used in this experiment was CCMP 1975 isolated from the Chesapeake Bay which differs in toxin profiles and ITS sequence from K. veneficum isolated from the north Atlantic that was used for chloroplast genome sequencing [43]. The combination of AT bias, sequence size over 1 kb, over ten fold coverage, and matches between transcriptome and genome assembly could be used simultaneously to define genuine parasite genomic sequence.

Anomalous alignment results

The BLASTx program reported the number of identities and positives in the pairwise alignment between each translated query and the best subject sequence or top hit. Identities are the count of identical amino acids in the alignment, while the positives are the identities plus all the amino acids with positive scores based on the amino acid scoring matrix (BLOSUM 62). Some BLASTx pairwise amino acid alignments of AT rich sequences contained stop codons when using the default or standard genetic code (NCBI genetic code 1) (Fig 1). The standard genetic code treats UGA, UAA, and UAG as stop codons. In the example shown in Fig 1, interpreting UGA as tryptophan (NCBI genetic code 4) increased the alignment score by three identities and one positive. However, translating UGA as tryptophan did not resolve all the stop codons. Translating UAA and UAG as glutamine (NCBI genetic code 6) increased the positive score by two. One UAA codon, when translated as glutamine was aligned with a methionine, not increasing the score, but removing the in frame stop codon. Thus, in sequences where all three typical stop codons were present in the BLASTx alignment, the union of the identity or positive scores under the two alternative codes yielded higher scores than either alone. However, in the example shown in Fig 1 the increased score accounts for most but not all stop codons. No single genetic code currently implemented in BLASTx simultaneously translates UGA as tryptophan, and UAA and UAG as glutamine, because then a stop codon would not be specified.

The nucleotide sequence from comp102324 was translated as a query under three different genetic codes with the dashes (-) representing identically translated codons. The top amino acid translation uses NCBI genetic code 1, where UGA, UAA, and UAG are all interpreted as stop (*). These codons are shown above the amino acid translation and are boxed. The NCBI genetic code 4 translates UGA as tryptophan (W), increasing the number of pairwise identities in this example by three, and the number of positive matches by one (identified in the midline with +). However, three UAA stop codons still interrupt the open reading frame. The third translation using NCBI code 6 interprets UAA and UAG as glutamine (Q), which does not increase the number of pairwise identities, but does increase the number of positives, or similar amino acids by two. In this example there are no UAG codons. The top blast hit (Candida tenuis ATCC10573 gi 575519749) and pairwise identities or positives when translating all three stop codons as amino acids are shown below the different translations.

The identity and positive scores for top hits were compared between the three genetic codes for all 3,652 unique putative parasite sequences with BLASTx alignments in the initial annotation (Fig 2A). These 3,652 unique sequences were reduced from a total of 4,908 sequences when additional sequence variants were included, and all of the proportions are calculated without including sequences from the same graph component. Over half (54%) of putative parasite sequences had equal scores under the standard code and both alternative codes. When using NCBI genetic code 4 (UGA as tryptophan), 781 sequences had a higher number of identities or positives when compared to the NCBI reference sequence database. Similarly, when using NCBI genetic code 6 (UAA and UAG as glutamine) 1,528 sequences had higher number of identities or positive matches than the standard genetic code (Fig 2). Most queries with better scores when UGA was coded as tryptophan also had better scores when UAA and UAG were coded as glutamine (Fig 2). The increase in the number of identities for a given sequence ranged from one to 38, and the increase in the number of positive matches ranged from one to 51 when the de novo assembly was compared to the reference sequence database.

A. BLAST using the high AT content, putative parasite, sequences as queries and the NCBI reference sequence database as the subject (3,652 hits) or the translated transcriptome data from Amoebophrya sp. ex Akashiwo sanguinea as the subject (9,589 hits). For the comparison with the reference sequence database 3,652 queries with e-values ≤10 −10 roughly half (unshaded portion) had equal scores with all three genetic codes. Of the remaining sequences, most contained UAA or UAG codons which, when translated as glutamine increased the BLASTx score (light grey shading). Many of these sequences also contained UGA with increased scores when translated as tryptophan (darkest shading contains both UGA and UAA or UAG). A small fraction of the UGA-containing sequences did not have increased scores when UAA or UAG were translated as glutamine (light grey shading).

The proportion of hits with equal and increased scores under different genetic codes was slightly larger in the strain to strain comparison than the proportion when using the reference sequence database (Fig 2B). When putative parasite sequences from Amoebophrya sp. ex K. veneficum were compared with the combined Amoebophrya sp. ex A. sanguinea host and parasite dataset formatted as a database there were 9,589 novel hits at an e-value cut-off ≤ 1e-10 to (Fig 2B). For the Amoebophrya sp. ex K. veneficum sequences that had hits to the reference sequence database, 4,761 (of the 4,908 with redundancy) sequences also had BLASTx results of 1e-10 or lower in the comparison between strains. Overall, across the BLASTx comparisons between the two Amoebophrya transcriptomes, there were 5,024 instances of UGA that, when translated as tryptophan, increased the alignment score.

StringTie was used to infer transcripts from the SPAdes assembly using the RNAseq reads. StringTie transcripts annotated with Diamond as a rapid BLASTx tool against the non redundant database recapitulated results from de novo assembly of RNAseq reads. A total of 31,094 transcripts were inferred by StringTie and between 4,190 (genetic code 1) and 4,408 (genetic code 6) had hits to the non redundant database with an e-value cut-off <1e-9. As with the de novo assembled transcripts, the UAA and UAG codons were more frequent than UGA– 38,025 were counted in the top hits, while UGA codons were found 9,369 times.

As a sort of control three datasets were used as queries: 1) <58% AT content, or putative host fraction from the same RNAseq data as the parasite sequences, 2) the ≥58% AT content parasite sequences, and 3) the combined Amoebophrya sp. ex A. sanguinea host and parasite transcriptome assembly. These three datasets were compared to a database of a single species, Perkinsus marinus (which is part of the reference sequence database) using different genetic codes as described above. Using a single species database provides more consistency in the searches and speeds the analysis. For the first dataset, the <58% AT content sequences attributed to K. veneficum where each query was translated with multiple codes as above, there were 219 instances where in frame stops were found out of 13,314 top BLASTx matches with an e-value cut-off of <1•10 −10 and 83 examples of increased BLASTx scores (host organelle transcripts are AT biased and treated below). Of these 83 sequences with increased BLASTx scores eleven were likely parasite sequences based on genomic coverage and near 58% AT content, leaving a total of 72 out of 13,314 total hits with increased scores in this pairwise comparison. When comparing the second dataset of 4,906 sequences that had matches to the reference sequence database with ≥58% AT content fraction to P. marinus there were 3,480 BLASTx results of which 1,761 had stop codons in the query sequences and 1,402 instances where either one or both alternative code produced better results than the standard genetic code. Finally, in the third dataset of combined Amoebophrya sp. ex A. sanguinea host and parasite sequences there were 18,757 queries with e-values <1•10 −10 when compared to P. marinus (top hits only) and 377 instances of stop codons were seen in these queries. Of these, only 158 had increased scores with alternative genetic codes, of which 81 matched a single sequence from P. marinus, XP_002768980. As a test for a non-translated RNA misannotated as protein coding, the corresponding nucleotide sequence from P. marinus (XM_002768934) was used for an RFAM search and BLASTn comparison with Amoebophrya sp. ex A. sanguinea, neither of which revealed conserved nucleotide matches. For sequences likely derived from the K. veneficum host or the combined Amoebophrya sp. ex A. sanguinea sequences there were infrequent examples of increased scores in BLASTx comparisons with different genetic codes.

Identifying the amino acids associated with typical stop codons

In Amoebophrya sp. ex. K. veneficum sequences with increased BLASTx scores against the reference sequence database using alternative genetic codes, the most common amino acids aligned with query UGA codons were tryptophan, leucine, tyrosine, and phenylalanine (Fig 3A). Although increased scores indicated at least one site where UGA as tryptophan increased the pairwise alignment score, the results described in Fig 3 included all alignment positions where a UGA was present, including polar amino acids but not gaps. The default BLASTx BLOSUM62 scoring matrix for tryptophan has only phenylalanine, and tyrosine as positive matches selenocysteine was very infrequent and is not represented (see below). For UAA and UAG codons the most frequent amino acid was glutamine, but glutamate, lysine and arginine were also present and would be counted as positive matches (Fig 3B). The results from FACIL were consistent UAA and UAG were frequently associated with glutamine or glutamate (Fig 3C). The FACIL results for the standard glutamine codons, CAA and CAG were similar to those for UAA and UAG. With FACIL analysis UGA and UGG were associated with trypophan, leucine, tyrosine, and phenylalanine. As a further test, the proportion of UGA aligning with tryptophan in comparisons between the two parasite strains were calculated for alignments with e-value cut-offs from <1•10 −10 to <1•10 −200 using the set of putative parasite genes with increased scores when compared with the reference sequence database. The proportion of tryptophan for UGA codons increased from 72 to 83% as the e-value stringency was increased. A similar test with UAA or UAG codons showed 42% matching glutamine at an e-value cut-off of 1•10 −10 increasing to 53% at an e-value cut-off of 1•10 −200 .

A. The amino acids found in the top BLASTx alignments to the NCBI reference sequence protein database where scores were increased relative to the standard genetic code when encoding UGA as tryptophan. The amino acids are grouped into hydrophobic, polar, negative or positively charged, and special cases. The UGA commonly was aligned to gaps, but these cases are not shown here. Only alignments to tryptophan, tyrosine, or phenylalanine increase BLASTx positive alignment scores with the matrix used however all UGA codons within sequences with increased BLASTx scores are used in this comparison. B. Amino acids associated with UAA and UAG codons in BLASTx comparisons where scores were increased when translating UAA and UAG as glutamine. Gaps were not included in this analysis. C. FACIL analysis of the genetic code. The four possible glutamine and two likely tryptophan codons are shown. The most commonly found amino acids when these codons are translated and compared to the protein family (pfam) database are shown as a sequence logo with height proportional to frequency as inferred by FACIL.

A comparison using all available genetic codes in BLASTx was conducted against the P. marinus database using the 2,006 queries from Amoebophrya sp. ex K. veneficum with increased scores against the reference sequence database when using alternative genetic codes 4 and 6. In this comparison of 19 genetic codes, the highest positive score was found in 1,020 of 1,499 hits when using NCBI code 6 (UAA and UAG as glutamine), followed by genetic code 1 with 251, while 507 sequences had no hit (at an e-value cut-off of 1•10 −10 ). Recoding UGA as tryptophan is common to nine of the 19 different genetic codes tested.

Determining the stop codon

Because increases in BLASTx scores were observed for all three typical stop codons, the stop codon was determined from a set of highly conserved sequences. Of the 158 putative parasite sequences in the de novo transcriptome assembly with a zero e-value to NCBI reference sequences, 74 had a complete open reading frame ending in a UGA codon that was in a similar position to the stop codon of the best BLASTx result from the reference sequence database. Five contained upstream in frame UGA codons, followed by a UGA codon in a position consistent with a stop, again based on comparison to top blast hits. One had an apparent intron, but also contained a UGA consistent with a stop codon. The stop codon was predicted to be UGA in all cases and no UAA or UAG codons were found in positions consistent with a stop. The remaining 78 sequences were partial in comparison to the BLAST subjects and were not treated further. Similarly the stop codon was consistently UGA in ribosomal proteins aligned for phylogenetic analysis [7]. In contrast, the position of UGA codons inferred to be tryptophan based on BLASTx comparisons were biased towards the start or 5’ end of mRNA when compared with the total length of the top hit (Fig 4A). This result was consistent when using comparisons to either the reference sequence database or P. marinus.

A. The amino acid position of the last or most 3’ UGA codon in the query sequence alignment was divided by the total length of the top hit in the BLASTx comparison and plotted as a histogram to indicate where the UGA codons were found relative to the total length of the subject sequence. Comparisons to P. marinus (red) and the reference sequence (black) databases are shown for AT biased sequences with increased BLASTx scores when using alternative genetic codes. B. The composition before and after UGA apparently encoding stop versus UGA as tryptophan, based on increased BLASTx scores when UGA was translated as tryptophan. In frame UGA as stop for 457 non-redundant sequences from the parasite fraction were compared to 333 in UGA as tryptophan using two sample logo with the t-test. The size of the nucleotide shows the relative enrichment or depletion of each nucleotide in the upstream and downstream positions, but is shown only for nucleotides enriched or depleted at a p-value below 0.05 (see S2 Fig for raw frequency values).

The nucleotide frequency before and after UGA codons differed only slightly between those inferred to encode tryptophan or as stop (Fig 4B). For UGA codons that likely encode tryptophan based on increased BLASTx scores there was a repeated pattern of increased AT bias at third positions both before and after UGA codons (S2 Fig). This would be expected in coding sequences where second and first positions are more constrained, but was not seen before or after stop codons. The two sample logo program uses a t-test to determine significant enrichment or depletion at p values less than 0.05 for the bases centered around the two different senses of the UGA codon (Fig 4B). This test reveals a general pattern of increased U two and seven bases downstream of the stop codon. However these results need to be interpreted with caution as the overall abundance of U was below 20.2% at these positions. Inverting the input order to demonstrate enrichment or depletion in coding sequence relative to stop codons showed even lower enrichment values of 15.2% for significantly enriched or depleted bases.

Sources of interrupted reading frames: Organellar genes, RNA editing, and selenocysteine

Organellar transcripts, RNA editing and UGA codons encoding selenocysteine could independently contribute to finding typical stop codons in transcriptome datasets. For example, host organelle transcripts were found in the ≥58% AT content sequences. A total of 26 transcript sequences (21 non-redundant) with AT content from 59–66%, had >90% nucleotide identity to the previously determined K. veneficum chloroplast sequences [33]. Only one sequence had an increased score with the alternative genetic code, and that fragment corresponded to a chloroplast rRNA region.

The mitochondrial genome of dinoflagellates and apicomplexans has only three protein coding genes, coxI, coxIII, and cytB [44]. RNA editing is also a common feature in dinoflagellate mitochondrial transcripts and provides an opportunity to test for RNA editing as a potential reason for in-frame stop codons in Amoebophrya sp. ex K. veneficum transcripts. Because these protein-coding genes are conserved, AT biased, and highly expressed, they can be readily identified based on text searches of annotations followed by manual inspection and sorted between host and parasite using BLASTn searches against the uninfected host data [32,34]. Comparing genomic and expressed sequences, the mitochondrion-encoded transcripts from Amoebophrya sp. ex K. veneficum showed changes consistent with mitochondrial editing (Table 1). The three parasite protein-coding transcripts from the mitochondrion for coxI, coxIII, and cytB, were strongly AT biased (68–69% AT content). Comparing the genome and RNA-seq data for these transcripts showed a total of 57 differences, similar to values seen in core dinoflagellates [44] and Hematodinium sp. [45]. However, in Amoebophrya sp. ex K. veneficum the editing was limited exclusively to A->G and T->C changes.

Similarly, the rare amino acid selenocysteine is typically encoded by UGA codons that are contingently translated based on features of the 3’ UTR [46]. Only two instances of selenocysteine were found in the BLASTx search, however, automatically generated open reading frames may not properly map UGA to selenocysteine, instead treating this codon as a stop. A previous survey of Oxyrrhis marina found a total of four selenoproteins [47]. Based on text searches of annotations and comparisons to core dinoflagellate transcriptomes, for the AT biased, putative parasite sequences a total of three likely selenocysteine-containing genes were found: selT, selM, and selO, each with one UGA likely encoding selenocysteine. Using SECISaln [46], a selenocysteine insertion sequence was found in the 3’ UTR of selT.

Gene ontology terms

The putative parasite sequences from Amoebophrya sp. ex K. veneficum were divided into two bins based on the optimal BLASTx score when using reference sequence database. The first contained the 2,900 (1,988 non redundant) sequences where the scores were unchanged with different genetic codes (standard set), and a second with 2,006 (1,664 non redundant) sequences where the optimal score was found using either the ciliate or mitochondrial genetic codes or both (alternative set) (Fig 2A). These same two sets of sequences were also used for codonw analysis as described in detail below. The overlap between sequences containing UAA or UAG and those with UGA that increased BLASTx scores demonstrated these codons were often detected together on the same transcripts (Fig 1). Because translation with no defined stop codon is very difficult to accomplish in silico especially with long UTR, a relational strategy was used to create proxy amino acid sequences for GO analysis. Nucleotide sequences from Amoebophrya sp. ex K. veneficum were matched to translated sequences from Amoebophrya sp. ex A. sanguinea based on top BLASTx hits. The top matching Amoebophrya sp. ex A. sanguinea protein sequences were extracted and then sorted into those that contained typical stop codons in Amoebophrya sp. ex K. veneficum (1,118 sequences with average length of 1,059 amino acids) and those that did not (1,125 sequences with an average length of 629 amino acids), followed by annotation using BLAST2GO. The 1,118 sequences with the alternative codons had 359 KO terms and were enriched (p ≤ 0.05) for microtubule motor function, ion channel activity, and ATP and DNA binding (S3 Fig). For example, most of the dynein heavy chain protein family members (described in detail below), and all three DNA dependent RNA polymerase large subunits I, II, and III demonstrated alternative codon use. The coding sequence most enriched for UGA codons contained 12 UGA codons interspersed with 13 UGG codons and had a blast2GO hit to DNA phytolyase. On the other hand, the standard set of 1,125 sequences with 717 KO terms was enriched for translation initiation, ribosomal proteins, cation transporters, unfolded protein binding, and threonine type endopeptidase activity (S3 Fig).

Typical stop codons in dynein heavy chain transcripts

With a genome and transcriptome survey in hand and preliminary results from GO terms suggesting potential for codon bias and different categories of transcripts, a tractable gene family was selected which would provide a robust and practical challenge for the assertion that UGA was conditionally translated and often linked with the other two typical stop codons. Another goal was to test if pseudogenes or unspliced introns led to the BLASTx results presented above. Dynein Heavy Chain (DHC) was used as an example since these are amongst the longest transcripts and inference of the 13 to 16 kb transcripts required long sections of genomic assembly. Preliminary analysis of the de novo assembled transcripts suggested high expression levels and a wide range of typical stop codons for sequences annotated as dynein heavy chain. From an annotation perspective the gene family is tractable, has a well organized nomenclature, and there are different roles for the different subclasses [48]. Using 67 queries for Symbiodinium kawagutii gleaned from the CyMoBase website and a cutoff of >3,000 pairwise amino acid alignment to cover the majority of the query sequences, a total of 14 DHC transcripts were found in tBLASTx searches against the StringTie transcripts (Table 2). The identity between S. kawagutii and Amoebophrya sp. ex K. veneficum ranged from 46% over 3,680 amino acids for DHC1, to 67% over 4,717 amino acids for DHC3A. Using the nomenclature of Kollmar [48] and best tBLASTx identity for preliminary orthology assessment, the inventory putatively identified DHC subtypes 1 to 9. Reciprocal best hits between genomes are not yet tractable, as both the S. kawagutii and Amoebophrya genomes are likely not complete. For these dynein sequences, the 3’UTR varied from 365 to 1,807 bases, with an average length of 1,022 bases.

Alternative start codons are different from the standard AUG codon and are found in both prokaryotes (bacteria and archaea) and eukaryotes. Alternate start codons are still translated as Met when they are at the start of a protein (even if the codon encodes a different amino acid otherwise). This is because a separate transfer RNA (tRNA) is used for initiation. [1]

Eukaryotes Edit

Alternate start codons (non-AUG) are very rare in eukaryotic genomes. However, naturally occurring non-AUG start codons have been reported for some cellular mRNAs. [2] Seven out of the nine possible single-nucleotide substitutions at the AUG start codon of dihydrofolate reductase were functional as translation start sites in mammalian cells. [3] In addition to the canonical Met-tRNA Met and AUG codon pathway, mammalian cells can initiate translation with leucine using a specific leucyl-tRNA that decodes the codon CUG. [4] [5]

Candida albicans uses a CAG start codon. [6]

Prokaryotes Edit

Prokaryotes use alternate start codons significantly, mainly GUG and UUG. [7]

E. coli uses 83% AUG (3542/4284), 14% (612) GUG, 3% (103) UUG [8] and one or two others (e.g., an AUU and possibly a CUG). [9] [10]

Well-known coding regions that do not have AUG initiation codons are those of lacI (GUG) [11] [12] and lacA (UUG) [13] in the E. coli lac operon. Two more recent studies have independently shown that 17 or more non-AUG start codons may initiate translation in E. coli. [14] [15]

Mitochondria Edit

Mitochondrial genomes use alternate start codons more significantly (AUA and AUU in humans). [7] Many such examples, with codons, systematic range, and citations, are given in the NCBI list of translation tables. [16]

Amino-acid biochemical properties Nonpolar Polar Basic Acidic Termination: stop codon
Standard genetic code
2nd base 3rd
U UUU (Phe/F) Phenylalanine UCU (Ser/S) Serine UAU (Tyr/Y) Tyrosine UGU (Cys/C) Cysteine U
UUA (Leu/L) Leucine UCA UAA Stop (Ochre) [B] UGA Stop (Opal) [B] A
UUG [A] UCG UAG Stop (Amber) [B] UGG (Trp/W) Tryptophan G
C CUU CCU (Pro/P) Proline CAU (His/H) Histidine CGU (Arg/R) Arginine U
CUA CCA CAA (Gln/Q) Glutamine CGA A
A AUU (Ile/I) Isoleucine ACU (Thr/T) Threonine AAU (Asn/N) Asparagine AGU (Ser/S) Serine U
AUA ACA AAA (Lys/K) Lysine AGA (Arg/R) Arginine A
AUG [A] (Met/M) Methionine ACG AAG AGG G
G GUU (Val/V) Valine GCU (Ala/A) Alanine GAU (Asp/D) Aspartic acid GGU (Gly/G) Glycine U
GUA GCA GAA (Glu/E) Glutamic acid GGA A
A The codon AUG both codes for methionine and serves as an initiation site: the first AUG in an mRNA's coding region is where translation into protein begins. [17] The other start codons listed by GenBank are rare in eukaryotes and generally codes for Met/fMet. [18] B ^ ^ ^ The historical basis for designating the stop codons as amber, ochre and opal is described in an autobiography by Sydney Brenner [19] and in a historical article by Bob Edgar. [20]

Engineered initiator tRNAs (tRNA fMet2 with CUA anticodon) have been used to initiate translation at the amber stop codon UAG. [21] This type of engineered tRNA is called a nonsense suppressor tRNA because it suppresses the translation stop signal that normally occurs at UAG codons. One study has shown that the amber initiator tRNA does not initiate translation to any measurable degree from genomically-encoded UAG codons, only plasmid-borne reporters with strong upstream Shine-Dalgarno sites. [22]

One codon, two amino acids – the genetic code has a Shift key

Living things, from bacteria to humans, depend on a workforce of proteins to carry out essential tasks within their cells. Proteins are chains of amino acids that are strung together according to instructions encoded within that most important of molecules – DNA.

The string of “letters” that make up DNA correspond to chains of amino acids, and they are read in threes, with every combination representing one of many amino acids. Until now, scientists believed that this relationship is unambiguous – within any single genome, every three-letter combination maps to one and only one amino acid. This strict one-to-one relationship is a tenet of genetics, but new research shows that it’s not an absolute one.

A team of American scientists have found a surprising exception to this rule, within a sea microbe called Euplotes crassus. In its genome, one particular triplet of DNA letters can stand for one of two different amino acids – cysteine or selenocysteine – even within the same gene. It all depends on context. This is the first time that such dual-coding has been spotted in the genes of any living thing.

Before I go any further, it’s probably a good idea to have a quick primer on the genetic code for non-scientists. Anyone with prior knowledge of genetics can just skip the next four paragraphs. DNA is a chain of four molecules called nucleotides – adenine, cytosine, guanine and thymine, represented by the letter A, C, G and T. These sequences are transcribed into a similar molecule called messenger RNA (mRNA), which contains three of the same nucleotides, but replaces thymine with uracil (U). It’s the information coded by mRNA that is finally translated into proteins.

Proteins are built from 20 different amino acids, chained together in various combinations. In mRNA, every three letters corresponds to a specific amino acid. These three-letter combinations are called “codons“, the genetic equivalent of words. For example, the codon CCC (three cytosines in a row) corresponds to the amino acid proline, while AAA (three alanines) corresponds to lysine. And some codons act as full-stops, indicating that the amino acid chain has come to an end.

This genetic code is almost universal. The same codons almost always match up to the same amino acids in tiny bacteria, tall trees and thoughtful humans. There are a few deviations from the universal template, but even then, the differences are relatively minor. Think about computer keyboards – almost all have the same configuration of keys for various letters and symbols, but some will have the @ key in a different place.

The genetic code is redundant, so that several codons represent the same single amino acid, but there are no ambiguities. There are no examples of a single codon within any genome that represents more than one amino acid. That is, until now.

The Euplotes crassus Code

Anton Turanov, Alexey Lubanov and Vladimir Gladyshev from the University of Nebraska have discovered that in Euplotes crassus, the UGA codon can mean either cysteine or selenocysteine, depending on its location in the gene.

In the universal code, UGA is a stop signal but many species use it to signify selenocysteine, an amino acid that isn’t represented in the universal code. This alternative translation of UGA into selenocysteine hinges on a structure called a SECIS element. The SECIS is part of the mRNA molecule itself but sits outside the region that actually codes for amino acids. It’s like a genetic Shift key – its presence changes the meaning of UGA codons that sit before it.

What makes E.crassus unique is the fact that its UGA codons can mean either selenocysteine or cysteine – a choice between two amino acids rather than one amino acid and a stop signal.

Turanov and Lubanov analysed the microbe’s tRNAs -molecules with one end that recognises a specific codon and another that sticks to its corresponding amino acid. These are the decoders that translate strings of codons into strings of amino acids. It turned out that E.crassus has different tRNAs that recognise UGA – one of these matches the codon with cysteine and another matches it with selenocysteine.

Turanov and Lubanov also purified a protein from E.crassus called Tr1. Its RNA has a SECIS element and five UGA codons, and the duo found that the first four of these are translated into cysteines and the fifth into selenocysteine. Location is all-important when it comes to working out which interpretation comes out top. When Turanov and Lubanov added lots of UGA codons at sites throughout the TR1 gene, they found the vast majority were translated into cysteines. Only those inserted at the end of the gene, within its final 20 codons and near the SECIS element, were interpreted as selenocysteines.

So the SECIS element, in its Shift-key role, affects the fate of nearby UGAs. To confirm that, Turanov and Lubanov replaced the entire SECIS element in the TR1 gene with an equivalent element from a different gene and a different species. They found that this new SECIS element had a wider zone of influence when it was introduced, UGA codons that sat outside the final 20 were translated into selenocysteines instead of cysteines.

So in E.crassus, the UGA codon is not tied to a single fate – it has a choice. It can be interpreted in two different ways, depending on its location and that of the SECIS element that influences it. One codon, two amino acids – it’s a unique set-up and further proof that the genetic code, universal though it almost is, is open to expansion and evolutionary change.

Reference: A. A. Turanov, A. V. Lobanov, D. E. Fomenko, H. G. Morrison, M. L. Sogin, L. A. Klobutcher, D. L. Hatfield, V. N. Gladyshev (2009). Genetic Code Supports Targeted Insertion of Two Amino Acids by One Codon Science, 323 (5911), 259-261 DOI: 10.1126/science.1164748


Computation of approximate gene positions

Initial predictions of the PCGs are computed using the hidden Markov models (HMMs) and methods from ( 20), which in turn make use of HMMER ( 21). These models have been generated based on the amino-acid sequences of the PCGs in RefSeq 63 with an automated method that takes their phylogenetic classification into account. It was shown that the predictions made with these models are specific and sensitive, but lack a precise annotation of the start and stop codon positions ( 20). Therefore, we subsequently improve the start and stop codon positions of these initial annotations. We first briefly describe how the gene boundaries were selected in the original implementation of MITOS and then introduce our newly developed approach in detail.

Prediction of start and stop codon positions in MITOS

The original implementation of MITOS employs a very simple method to predict start and stop codons of PCGs. Given approximate start and stop positions (provided, e.g. by a BLAST search) the proximity (per default in a range of ±6 amino acids) is checked for start and stop codons, respectively, using the genetic code tables of the NCBI Taxonomy ( 22) ( If no valid start or stop codon can be identified, MITOS chooses the corresponding approximate position as gene boundary.

Improved probabilistic prediction of start and stop codon positions

Given an approximate gene position, all codons between the adjacent upstream stop codon and the (inframe) center point of the initial prediction are considered as potential start sites. Analogously, codons between the (inframe) center and the nearest downstream stop codon are taken into account as potential stop codons positions. For the determination of these search ranges, full stop codons according to the NCBI genetic code tables are considered. In the following, we denote by S and E the sets of positions that are evaluated as potential start and stop positions, respectively.

The most probable start and stop positions of a gene are determined by maximizing the product of three factors over all possible candidate positions in S and E, respectively: (1) a factor (δ) that depends on the distances of the candidate positions to the estimated start or stop position inferred by comparison with the query model, (2) the (empirical) probability that the codon at the candidate position is a start or stop codon (ϕ), (3) and the (empirical) probability of the resulting gene length (λ). These factors are quantified as follows:

Genetic code development by stop codon takeover

A novel theoretical consideration of the origin and evolution of the genetic code is presented. Code development is viewed from the perspective of simultaneously evolving codons, anticodons and amino acids. Early code structure was determined primarily by thermodynamic stability considerations, requiring simplicity in primordial codes. More advanced coding stages could arise as biological systems became more complex and precise in their replication. To be consistent with these ideas, a model is described in which codons become permanently associated with amino acids only when a codon-anticodon pairing is strong enough to permit rapid translation. Hence all codons are essentially chain-termination or “stop” codons until tRNA adaptors evolve having the ability to bind tightly to them. This view, which draws support from several lines of evidence, differs from the prevalent thinking on code evolution which holds that codons specifying newer amino acids were derived from codons encoding older amino acids.


The genetic code, as arranged in the standard tabular form, displays a non-random structure relating to the characteristics of the amino acids. An alternative arrangement can be made by organizing the code according to aminoacyl-tRNA synthetases (aaRSs), codons, and reverse complement codons, which illuminates a coevolutionary process that led to the contemporary genetic code. As amino acids were added to the genetic code, they were recognized by aaRSs that interact with stereochemically similar amino acids. Single nucleotide changes in the codons and anticodons were favored over more extensive changes, such that there was a logical stepwise progression in the evolution of the genetic code. The model presented traces the evolution of the genetic code accounting for these steps. Amino acid frequencies in ancient proteins and the preponderance of GNN codons in mRNAs for ancient proteins indicate that the genetic code began with alanine, aspartate, glutamate, glycine, and valine, with alanine being in the highest proportions. In addition to being consistent in terms of conservative changes in codon nucleotides, the model also is consistent with respect to aaRS classes, aaRS attachment to the tRNA, amino acid stereochemistry, and to a large extent with amino acid physicochemistry, and biochemical pathways.

Watch the video: Η Νίκη Κεραμέως για τις αλλαγές που έρχονται στα σχολεία από τον Σεπτέμβριο. Τώρα ό,τι συμβαίνει (July 2022).


  1. Gora

    Totally agree with her. Great idea, I agree with you.

  2. Kofi

    This topic is just amazing :), very interesting to me)))

  3. Sar

    In my opinion, you admit the mistake. Enter we'll discuss. Write to me in PM.

  4. Kigajar

    This idea would have just by the way

Write a message