Question

Can Variants Be Reliably Detected In Duplicated Gene/Region Sequence?

3

Entering edit mode

12.3 years ago

Ian 6.1k

I am currently working on a project to detect variants in related yeast strains, which is simple enough, at least for variant calling :).

However, the PI is interested in genes that have been duplicated, e.g. ribosome genes. This means that uniquely mapping reads to the genome results in zero coverage over the genes/regions of interest.

Has any one done something similar to this?

What would the best mapping strategy be to include duplicated sequence, but also be suitable for variant detection?

Part of me wonders whether this is even possible. But the the worst can think can happen is that reads will be split between duplicates, but a mismatch could lead to a misplaced read...

I am currently using bowtie to obtain --best -k1 reads with other default settings, leading to samtools based variant detection.

BTW reads a colour-space from a SOLiD4.

Thanks!

variant-calling • 3.8k views

ADD COMMENT • link updated 12.3 years ago by Giovanni M Dall'Olio 28k • written 12.3 years ago by Ian 6.1k

Ram · Answer 1 · 2013-04-11

I think that this is very difficult. Most of the methods to detect SNPs do not work correctly with duplicated regions, and in fact, a best practice is to remove all the reads that map to multiple regions of the genome before doing SNP calling. These reads are likely to be copy number variations, so it's better to remove them, as the SNP found will be likely to be a false positive.

Quoted from the 1000 Genomes paper:

"We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads".

The regions that have an "unexpectedly high or low numbers of reads" are likely to be duplications and deletions. So, in 1000 Genomes, they remove all the ambiguous reads before doing the calling. I am sure that if you take any other article presenting novel SNP data (Hapmap, etc..), you will be able to find a similar sentence in the Methods. Hopefully that will be enough to convince your PI :-)

For example, let's imagine that the following sequence got duplicated in the genome:

AACCTTGG

The resulting genome will look like:

AACCTTGGnnnnnnnnnnnAACCTTGG

After a number of generations, the nucleotide in the 5th position of the second duplicated region get mutated:

AACCTTGGnnnnnnnnnnnAACCGTGG

If you do a SNP calling without taking into account that this sequence can be duplicated, you will believe that there is a single copy of this duplication, containing a SNP in the 5th position, and with a frequency of 50%. But this would be a false positive, as there is no SNP, but a duplication.