Very high coverage Nanopore alignment Hg38
1
0
Entering edit mode
20 months ago
pablo ▴ 310

Hi,

I aligned my first Nanopore reads on the hg38 reference. Then, I used bedtools genomecov to get an idea about the mean coverage over the genome.

I noticed some bases have a very high coverage >20,000X , whereas the mean coverage is ~10X . I extracted the corresponding alignments (20,000X) from the bam file, and used IGV to have a look.

As you can see, the very-high covered region has many insertions (>100bp). We are not in a telomeric region. Why all the reads has this profile, and why so many reads mapped here ? I also checked the flags of the alignments here :

  number  flag
   2532   2064
   2488   2048
   7367   272
   7479   256
    936   16
   1023   0

There are many not primary alignments / secondary / supplementary alignments. But also, 1023 + 936 primary alignments, which gives a very high coverage anyway.

chr22

We have the same profile here, but in a telomeric region, with many repetitive motifs.

enter image description here

IGV nanopore alignment • 1.0k views
ADD COMMENT
0
Entering edit mode

It seems that this is telomeric region of the chromosome and all the reads of clipped reads (I think its hard-clipped). I have also observed the same in my nanopore data. These are mapping artifacts in repeatative regions of the genome. You either need to identify these kind of regions in the BAM file and remove them prior to variant calling. Or, you can easily identify these regions from the resulting VCF file as well. In a VCF file, you need to look at the site mean depth (either using vcftools or bcftools) and then filter those sites having unusual depth of coverage.

Check out the way to do it from here: https://speciationgenomics.github.io/filtering_vcfs/ (check out these two sections: Calculate mean depth per site section and Variant mean depth)

Additionally, most variant callers only consider primary alignments, so you are also safe by not removing them as most of the reads are secondary. But there are other regions in the genome that may such artifacts, so its better to filter them off.

ADD REPLY
0
Entering edit mode
20 months ago

This will likely be an artifact, as you know. Check the Mapping quality MQ of these reads and use a MQ filter of 30 for example to avoid repeat regions.

This is also why people use MQ and maxDepth in their SNP calling. That said, I've never seen a region as bad as that.

ADD COMMENT

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6