Can Variants Be Reliably Detected In Duplicated Gene/Region Sequence?
1
3
Entering edit mode
11.7 years ago
Ian 6.1k

I am currently working on a project to detect variants in related yeast strains, which is simple enough, at least for variant calling :).

However, the PI is interested in genes that have been duplicated, e.g. ribosome genes. This means that uniquely mapping reads to the genome results in zero coverage over the genes/regions of interest.

Has any one done something similar to this?

What would the best mapping strategy be to include duplicated sequence, but also be suitable for variant detection?

Part of me wonders whether this is even possible. But the the worst can think can happen is that reads will be split between duplicates, but a mismatch could lead to a misplaced read...

I am currently using bowtie to obtain --best -k1 reads with other default settings, leading to samtools based variant detection.

BTW reads a colour-space from a SOLiD4.

Thanks!

variant-calling • 3.5k views
ADD COMMENT
2
Entering edit mode
11.7 years ago

I think that this is very difficult. Most of the methods to detect SNPs do not work correctly with duplicated regions, and in fact, a best practice is to remove all the reads that map to multiple regions of the genome before doing SNP calling. These reads are likely to be copy number variations, so it's better to remove them, as the SNP found will be likely to be a false positive.

Quoted from the 1000 Genomes paper:

"We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads".

The regions that have an "unexpectedly high or low numbers of reads" are likely to be duplications and deletions. So, in 1000 Genomes, they remove all the ambiguous reads before doing the calling. I am sure that if you take any other article presenting novel SNP data (Hapmap, etc..), you will be able to find a similar sentence in the Methods. Hopefully that will be enough to convince your PI :-)

For example, let's imagine that the following sequence got duplicated in the genome:

AACCTTGG

The resulting genome will look like:

AACCTTGGnnnnnnnnnnnAACCTTGG

After a number of generations, the nucleotide in the 5th position of the second duplicated region get mutated:

AACCTTGGnnnnnnnnnnnAACCGTGG

If you do a SNP calling without taking into account that this sequence can be duplicated, you will believe that there is a single copy of this duplication, containing a SNP in the 5th position, and with a frequency of 50%. But this would be a false positive, as there is no SNP, but a duplication.

ADD COMMENT

Login before adding your answer.

Traffic: 2112 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6