We have encountered strange pattern in bam file that is generated from amplicon sequencing. (Nextera XT, illumina MiSeq).
As you can see at the middle of bam file, rectangular-shaped coverage is formed. Half of the reads finished at the right side of the rectangular; while other half of reads finished at the left side of rectangular.
What can cause such abnormal coverage distribution? Any structural variant?
We solved the real cause of this pattern. It's duplication event of 200 bp area in that rectangular area. We BLAST the unmapped parts of reads at the ends of rectangular area. And we found out that they are perfectly matched to region inside of this area.
If it's amplicon sequencing, wouldn't you expect uneven coverage that corresponds to your amplicons?
Also, these are not randomly sheared libraries. Nextera transposase cuts at certain site. You should expect to see more fragments at specific sequences.
Dear igor,
Actually you are right about uneven coverage of Nextera kits. We see such changes in especially in GC rich sites. However we did not come across such pattern in exome sequencing. As you know both of exome sequencing kits (Nextera Rapid Capture Exome) and Nextera kits use same transposese.
I repeated the alignment and realignment steps with BWA and GATK. The results are quite different now.
The upper image was taken after alignment+realignment of CLC genomics Workbench.
The middle image was taken after alignment with BWA and realignment with GATK.
The image in bottom was taken after alignment with BWA.
Could it be due to partial duplication of this segment to somewhere else?
Hmm, well that certainly did something, but your pileup still looks weird.
Now i'm thinking perhaps it wasn't an indel, but some contamination. I would definitely start by taking the sequence of DNA that mapped there and BLASTing it. I also would consider throwing up a mappability track for your reference genome to see if mappability in that region is lower than usual.
Thank you John. Now I will check all rest of these trimmed reads and see that if they are aligned to somewhere else. However, I couldn't understand that what kind of contamination may cause this pattern.
Could it be cDNA contamination, coupled with an isoform that isn't in your gene track? Check ensembl to view known isoforms, and look for soft-clipping that matches up with the previous or next exons.
Dear Chris, thank you for your reply. But we are not expecting cDNA in our sample. It's only PCR amplified (long-range primer set) products from genomic DNA. Also, I tried to remove repeats with built-in module of CLC Genomics Workbench. Unfortunately, it didn't change the coverage pattern at all.
thanks for following up, interesting