One of my side research projects involves processing large numbers of genomes (specifically, all fully-sequenced prokaryotic genomes). Since I’m playing with the data anyway, sometimes I end up with random questions that can be answered with what I already have on hand. One such question is this: “What is the average length of a prokaryotic gene?” We could figure this out fairly directly, but it’s always best to have a prediction in hand first. After all, if we have no idea what kind of values to expect, how can we trust the accuracy of a more direct (and experimental) method?

So what do we know? There are 4 possible bases (A, G, C, and T) and three such bases make up a codon. This means that each position of the codon can be any of 4 bases, so there are 4*4*4 = 64 possible codons. Of these, 3 are stop codons (meaning that they mark the end of a gene). We generally think of there being only 1 start codon (ATG, coding for methionine), but it turns out that prokaryotes often use other codons instead. Plus, if there are multiple ATG’s in the same stretch of DNA, how do we know which is the actual start?

For example, take the sequence:

(**Sequence 1**) **ATG** AGT TGA **ATG** GTA TTG ** TAA** TTT AGA

*TAA*This sequence has two potential start sites (in **bold**) and two stop codons (in * bold italics*). We can unambiguously choose the first stop codon, but we have no way of knowing without more evidence which start codon is the real one.

To get around this, let’s take a conservative approach in calling sequences a “gene”. Instead of anything beginning with a start codon and ending with a stop, let’s take the entire genome and blast it to bits by cutting at every stop codon.

For example:

(**Sequence 2**) **ATG** AGT TGA **ATG** GTA TTG ** TAA** TTT AGA

**CTC GAT CGA TCG CTC GAT CTC**

*TAA**GGA TCG*

**TGA****ATG**

*TAA*This sequence would be broken up at every stop site (like TAA or TGA), giving us a set of sequences that lie between stop sites (underlined). Of course, most of these will not be genes and many will be extremely short. So while we’re working out a hypothesis, let’s call each of these “maybe-genes” (*mGene* for short) After breaking up Sequence 2 (and ignoring stop codons), we get the set of mGenes:

(**mGene Set 1**) [ AGT TGA **ATG** GTA TTG ** ,** TTT AGA , CTC GAT CGA TCG CTC GAT CTC

*GGA TCG*

**,****ATG**]

These sequences are all way to short to be genes, but this is the idea of what we’re trying to do. If I had blown apart an entire genome, what kind of mGene length distribution could we expect? My first thought was simplistic: If (3/64) codons are stops then, on average you would have ~20 non-stops before hitting the next one, making an average mGene length of 20. But, there is a minimum mGene length of zero and no maximum length, meaning that the distribution would not be so simple. And what would it mean to be “significantly different” from the expected size of 20 codons? Let’s re-think this a bit.

What is the chance of getting an mGene of length zero? Well, you would need to have two stop codons back-to-back:

(**Sequence 3**) … * TAA TGA* … which breaks up into

**(mGene Set 2**) [ … , , … ]

See that thing between the two commas? No? That’s because it’s an *empty string*. There is nothing between those two stop codons, which is equivalent to having an mGene of length 0 between them. So the chance of getting an mGene of length zero should be the same as the chances of getting two stop codons back-to-back, right? This is simply (3/64)*(3/64) = 9/4096 = ~0.22%. It isn’t really obvious what this means yet (to me, anyway), so let’s go on to predict the chance of getting an mGene of length one.

Just like for length zero, an mGene of length 1 needs to be bookended by stop codons. So the probability will the chance of having stop-nonstop-stop, which is just (3/64)*(61/64)*(3/64) = 3/448 = ~0.21%, just a little less than for length zero! It seems we should be able to just continue this on indefinitely to get the chance of getting an mGene of any length *L*:

This equation gives you a smaller and smaller value as *L* gets bigger. Okay but does this make any sense? The implication here is that there are more mGenes of length 0 than any other length. And another question: if this equation holds, wouldn’t it have to add up to 1 as L→ ∞ ? In equation form:

We can check this with a quick Python script.

def sum_mGene_lengths( end ): """Compute the running sum of P(L) from L=1 to L=end.""" sum = (3/64)**2 for L in range(1,end+1): sum += ( (3/64)**2 )*( (61/64)**L ) return sum def main(): while True: print(sum_mGene_lengths( int(input('Largest L: ') ))) main()

And if you run this and try out some numbers you get:

… Clearly this is not summing to 1.

Not even close!

I must have been thinking about this incorrectly (I do, after all, suck at statistics). However, as I’ll show in a later post, the probability given in Equation 1 does match the results from simulated data. So what’s the deal?

For any mGene of length *L*, Equation 1 gives the probability of such an mGene existing. The probability of an mGene *not* being that length is then 1-*P*(*L*), which is very nearly 1. So while the probability of an mGene length *not* being *L* and the probability of it being *L* sums to one, there actually isn’t any reason that the sums of all *P*(*L*) must be one. The fact that Equation 2 was found to go to some random number means that lengths larger than 1000 are essentially 0% likely to occur, which is equivalent to saying that the chance of an mGene *not* being of length 1000 is essentially 100%.

In closing, Equation 2 was totally wrong, though it illustrates an important point: many (perhaps most) hypotheses in scientific research are *wrong *(though there are definitely shades of wrong-ness), and so much of science is trying to make the most out of a wrong answer. So embrace failure!

So far I have laid out the problem and presented the beginnings of an approach to solving it. We have a probability model in hand (Equation 1), so that in the next post we can test it against simulated and then real genomic data.