Question

CNV or Paralog?

0

Entering edit mode

7.5 years ago

Mark ★ 1.6k

Hi everyone,

Hope this isn't a silly question. I have identified a region of my bacterial genome which contains higher than average coverage. Using bedtools genomecov I have calculated and plotted the coverage of a specific gene: https://i.imgur.com/pMs5bbZ.png note: X axis is the position in the genome, Y axis is the coverage per nucleotide.

The average coverage for this isolate is ~250. This particular gene hits around 1500 but then dips to ~0.

My question is how do I differentiate between a potential CNV and a paralogous alignment? The graph above is hinting at paralogous sequence. However there are two copies of the gene in the genome of the bacteria.

alignment gene next-gen genome • 2.2k views

ADD COMMENT • link 7.5 years ago by Mark ★ 1.6k

1

Entering edit mode

Hi, that looks a little unusual. If you can obtain the sequence of that region, then run it through nucleotide BLAST. That would add a little it more information to help you decide.

ADD REPLY • link 7.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Thanks for the reply. I've blasted the high coverage region which is ~300bp. The top ~20 hits are all to the same bacterium I'm studying and the same gene. I'm not sure how the BLAST results are going to help, can you clarify?

ADD REPLY • link 7.5 years ago by Mark ★ 1.6k

1

Entering edit mode

Thanks for the update.

If it were a true paralog, you would have seen BLAST results relating to other regions of the same genome. Just to be sure, relax the BLAST parameters and re-run, and then investigate the results again.

ADD REPLY • link 7.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Right, I follow your train of thought now.

If I specify my reference(which is highly annotated and on NCBI) in the blast parameters, the result is 2 hits to different regions of the reference genome. However, as I mentioned in the first post, I already know the genome contains 2 copies of this particular gene.

I think it is a sign of paralog, but it doesn't fully explain the super high coverage. If I were to assume 50% of the coverage can be attributed to each copy of the gene, the depth is sitting at 750 for that particular region, which is 3X the average depth.

I'm going to call it a paralog and be done with it. Appreciate your help Kevin! Thank you

ADD REPLY • link 7.5 years ago by Mark ★ 1.6k

1

Entering edit mode

Okay, no problem, but then obviously you will not be able to tell which reads have aligned correctly and which have not. So, any variant and CNV calls cannot really be trusted, even if they have very high qualities. This is where the use of short read NGS becomes very problematic.

Some of the extra reads that you didn't expect, even when considering that it's a paralog, may be related to PCR or optical duplicates.

Best of luck

Kevin

ADD REPLY • link 7.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

I agree completely, short read data is problematic when it comes to these sorts of analysis. Originally my supervisors wanted to call these CNV but I was extremely suspicious because only part of the gene had high coverage. Which is why I dug deeper and posted this question.

PCR or optical duplicates would example the high coverage for that specific region. Any advice on how to detect or investigate optical/PCR duplicates?

ADD REPLY • link 7.5 years ago by Mark ★ 1.6k

0

Entering edit mode

Yes, Picard has a function MarkDuplicates, which will give you this information. It runs in JAVA. To detect duplicates over a specific region, you will have to sub-set your aligned BAM file to extract reads over just that region. I believe that samtools view can do that, as mentioned here: C: Isolating reads from specific region from bam file

ADD REPLY • link 7.5 years ago by Kevin Blighe 89k