Question

Primer design for variants in duplicated genes?

0

Entering edit mode

7.9 years ago

emyli ▴ 10

Hi there,

I have whole genome sequencing data I am using to look for novel variants. I have filtered the data and have a short list of potential novel variants - so now I want to validate if they are truly present in my DNA sample or some sort of sequencing artefact, by PCR amplification. However, while trying to design gene specific primers for a number of these variants, I am finding that the primer pairs are amplifying multiple genomic regions of identical size, and upon alignment of these regions and the sorround DNA sequence, they are almost identical, ie. it seems the variants are located in duplicated genes. This of course could be why these variants are coming up in my WGS analysis in the first place, the reads may not have aligned properly. Has anyone encountered a similar issue? Is there a way to validate these variants? Any advice would be much appreciated!

sequencing next-gen alignment validation • 2.8k views

ADD COMMENT • link updated 3.8 years ago by asalimih ▴ 60 • written 7.9 years ago by emyli ▴ 10

1

Entering edit mode

In order to help you, please provide some details: how did you align the data, what organism, which variant caller, did you apply a MAPQ threshold, give an example of a region that you cannot get clear bands from (coordinates), how did you make the primers (did you BLAST them)?

ADD REPLY • link 7.9 years ago by ATpoint 89k

0

Entering edit mode

Try using primer-blast to generate primers if you have not already.

How large are the repetitive regions and what does the distribution of variants look like in the region? You might be able to Sanger if you can find a unique flanking sequence.

ADD REPLY • link 7.9 years ago by Daniel E Cook ▴ 280

score 2 · Answer 1 · 2017-11-04

The issues you face relate to the fact that the majority of the genome exhibits sequence similarity, i.e., similarity with other regions in the genome. Much of this is indeed related to gene duplication events, with the duplicated genes acquiring new functionality over time due to mutations. As a rough idea, there are up to 50,000 identified pseudogenes (who knows, exactly), which can be divided into:

processed pseudogenes: the pseudogene consists of the transcribed mRNA of the original gene
unprocessed pseudogenes: the pseudogene consists of the genomic sequence of the original gene

This translates into issues with Exome-seq because the primers used for sequence pull-down in exome-seq are not designed with these issues of sequence similarity in mind. Thus, when you align the data, you can set things like read length and mapping quality (MAPQ) to be high but then you'll see very low coverage over regions of high sequence similarity. On the other hand, if you relax the thresholds, you run the risk of misalignment and making false-positive or -negative variant calls.

What to do? To validate findings, you need to ensure that you design primers that uniquely target the region surrounding the variant being studied. If you cannot find a unique region in close proximity, you'll have to think about doing:

long-range PCR
Sanger or Roche 454 sequencing (long reads...)
MLPA

If you want assistance in designing the best possible primers, then please follow my standard operating procedure that I wrote back in 2012, with which I and colleagues have had success: Designing a single set of primers and probe for a genomic region of interest. In it, you will have to skip step 5.6 as it requires the use of Primer Express, but this is only needed in order to develop the probe that's used in addition to a primer pair in real-time PCR.

Kevin