Question

If I Have 4 Sequence Runs, 2 In Each Direction, 1 Bp Is Different, On Each, Should I Resequence?

3

Entering edit mode

14.7 years ago

John ▴ 790

I have a plate of colonies to sequence. I pick 2 colonies and sequence each in Fwd and Rev directions. I get back a single bp difference between the 2 strands. 2 bp have an T, two have a C. How should I call this base? Can I call it a Y (C or T) and leave it at that, or do I need to sequence another colony to be sure?

sequencing dna • 3.2k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by John ▴ 790

1

Entering edit mode

Thanks all, yes I had two good reads on each strand and the single bp on one of the colonies was different from that of the other colony (I'd picked 3 originally, but one was just an insert). I went with picking another 2 colonies to be sure. I'm sequencing ~100 markers though, so I was trying to weigh up the extra $$ / time in sequencing another colony with the extra information a C or T gives me over a Y. This is only the 15th sequence or so and the first time this has happened, so I'll see how the others turn out before deciding on a general policy.

ADD REPLY • link 14.7 years ago by John ▴ 790

1

Entering edit mode

Actually, what I'd really like to know is when you go to publish a sequence like this, how much coverage should you have? Is it acceptable to put a sequence with a Y into genbank, because you didn't go to the effort / cost of re-sequencing to resolve it? Or does the Y represent natural variation... and how many would you need to sequence to answer that question... :)

ADD REPLY • link 14.7 years ago by John ▴ 790

1

Entering edit mode

In this case I sequenced more colonies and found a consensus sequence, however, I'm cloning PCR products, so I don't think it is possible to say that there could not be natural variation in the PCR amplicon pool. One good example would be a bacterium with 2 different 16s inside a single cell, this could produce 2 populations of PCR products. If the ratio were 1:3, how often would you have to sample cloned colonies in order to observe this natural variation?

ADD REPLY • link 14.7 years ago by John ▴ 790

0

Entering edit mode

Just to clarify the situation: in each of your two sequencing runs you found a single base difference between the two strands, and that difference was in the same position in both cases?

ADD REPLY • link 14.7 years ago by Istvan Albert 101k

0

Entering edit mode

You should not put unreliable data into Genbank! Either you prove there is natural variation and you submit all the variants or you make sure you have reliable data and you don't have the problem.

ADD REPLY • link 14.7 years ago by Nicojo ★ 1.1k

0

Entering edit mode

In this particular case, you can not talk about natural variation: you are sequencing fragments you've cloned into a plasmid! Unless you've contaminated your prep with two colonies from the plate, all the plasmids in one prep should be identical. If you have ambivalent bases, then it's because your sequencing is of bad quality. You should never submit bad quality data to Genbank.

ADD REPLY • link 14.7 years ago by Nicojo ★ 1.1k

Ram · Answer 1 · 2010-03-11

I agree with chrisamiller and PhiS. I'll just add that it also greatly depends on what you will do with your sequence. I understand from your question that:

You have picked only 2 [bacterial] colonies for sequencing
These colonies result from the cloning of a PCR product (?)
They were sequenced using Sanger sequencing

[NOTE: when describing your problem it is very important to give these kind of details, so please correct me if my assumptions are wrong.] I am guessing that:

You might want to check that the sequence is correct (maybe verifying that your qPCR product is correct)?
You might be cloning a gene (or fragment thereof) in order to express a protein?

[NOTE: here again, these kind of details are crucial in determining if you can accept an ambiguous base or not. Please add a comment or edit your post if it is yet another purpose]

Finally, as Istvan has asked, you need to be clear as to what the difference is: are you looking at a different base call between the two sequenced colonies or between the forward and reverse sequencing events?

If it is the first (i.e. difference between the two colonies) then you need to check the quality of the call at that base (quality scores if you have them, or look at the chromatogram to see if there's a mistake or a double pic etc.). If they are good quality, then you probably have at least these two different variants of the sequence you're targeting.

If it is the second (i.e. difference between the forward and reverse) then you should also look at the quality in each read. If they are bad quality, sequence again. If they are good quality, then I'm scratching my head making a funny face. Start over from scratch.

Now to your question about leaving it ambiguous or not:

If you just wanted to check that the sequence is "fairly" OK, then fine, leave it as a Y.
If you're checking the amplicon of a qPCR event, then it is crucial to know if you have only one sequence or two different ones (even if it's a SNP). This will change your interpretation.
If you want to express a protein from this sequence, then you need to check if the difference (T or C) changes the resulting protein sequence: if yes, you need to choose the correct clone. If not, you can go with either.

score 2 · Answer 2 · 2010-03-11

There are lots of factors to consider here:

1) What do the quality scores tell you about the base call at that position?

2) How deep is your coverage? If you've got 1x coverage, it's possible that you may be seeing a miscalled base. If you're taking consensus from 30x coverage, it's much less likely.

3) You're sequencing from a population. It's completely possible that within this population there are individuals with both alleles that you're describing, right?

Ram · Answer 3 · 2010-03-11

As chrisamiller says, it depends on the details of what you're trying to do. The question is whether what you're seeing is variation due to technical error or due to biological variation.

However, without any additional information, if you've essentially only got 2 reads per sequence with contradicting information at a given position, you can't really call the base with any degree of certainty. In this case, the use of an ambiguity base call (i.e. Y instead of C or T) would be justified, in my view.