Hi,
I aligned my first Nanopore reads on the hg38 reference. Then, I used bedtools genomecov
to get an idea about the mean coverage over the genome.
I noticed some bases have a very high coverage >20,000X , whereas the mean coverage is ~10X . I extracted the corresponding alignments (20,000X) from the bam
file, and used IGV to have a look.
As you can see, the very-high covered region has many insertions (>100bp). We are not in a telomeric region. Why all the reads has this profile, and why so many reads mapped here ? I also checked the flags of the alignments here :
number flag
2532 2064
2488 2048
7367 272
7479 256
936 16
1023 0
There are many not primary alignments / secondary / supplementary alignments. But also, 1023 + 936 primary alignments, which gives a very high coverage anyway.
We have the same profile here, but in a telomeric region, with many repetitive motifs.
It seems that this is telomeric region of the chromosome and all the reads of clipped reads (I think its hard-clipped). I have also observed the same in my nanopore data. These are mapping artifacts in repeatative regions of the genome. You either need to identify these kind of regions in the BAM file and remove them prior to variant calling. Or, you can easily identify these regions from the resulting VCF file as well. In a VCF file, you need to look at the site mean depth (either using vcftools or bcftools) and then filter those sites having unusual depth of coverage.
Check out the way to do it from here: https://speciationgenomics.github.io/filtering_vcfs/ (check out these two sections: Calculate mean depth per site section and Variant mean depth)
Additionally, most variant callers only consider primary alignments, so you are also safe by not removing them as most of the reads are secondary. But there are other regions in the genome that may such artifacts, so its better to filter them off.