Question

Illumina shows SNPs, but Nanopore doesn't

0

Entering edit mode

3.4 years ago

al.grgr • 0

Hello all,

I have a puzzle which I cannot get through. Maybe some of you encountered this as well. I have a set of Nanopore (SQK-LSK109) & Illumina (NextSeq 500 2x150bp) reads for the same organism.

I assembled de novo genome with Flye 2.7 using: flye --nano-raw ~/nanopore.fastq -o ~/flye_assembly/ -g 35m -t 40

Total length: 32363893 Fragments: 86 Fragments N50: 4454457 Largest frg: 9249238 Scaffolds: 0 Mean coverage: 27
Then I mapped those raw Nanopore reads onto the assembly using minimap2: minimap2 -ax map-ont ~/flye_assembly.fa ~/nanopore.fastq

330418 + 0 in total (QC-passed reads + QC-failed reads) 214492 + 0 primary 0 + 0 secondary 115926 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 330418 + 0 mapped (100.00% : N/A) 214492 + 0 primary mapped (100.00% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) average coverage bam file, covered regions = 32.2968
I produced sorted BAM file (mapping quality > 20).
For Illumina data, I trimmed reads using popoolation_1.2.2/basic-pipeline/trim-fastq.pl (>70 bp, >30 q)

number of trimmed reads (x2): 10290402 total number of trimmed nucleotides (x2): 1476293771

I mapped illumina reads onto Flye assembly using BWA MEM (mapping quality > 20)

19111484 + 0 in total (QC-passed reads + QC-failed reads) 19096237 + 0 primary 0 + 0 secondary 15247 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 19111484 + 0 mapped (100.00% : N/A) 19096237 + 0 primary mapped (100.00% : N/A) 19096237 + 0 paired in sequencing 9558345 + 0 read1 9537892 + 0 read2 18940426 + 0 properly paired (99.18% : N/A) 19079555 + 0 with itself and mate mapped 16682 + 0 singletons (0.09% : N/A) 89147 + 0 with mate mapped to a different chr 89147 + 0 with mate mapped to a different chr (mapQ>=5) average coverage bam file, covered regions = 84.8934
I opened both BAMs in IGV browser and saw that many-many SNPs & indels (100% frequency) supported by Illumina are absent in Nanopore data whatsoever (see example in the snapshot below: top track is Nanopore, bottom track is Illumina).

enter image description here

I think this is really strange so see such clear discrepancies. Is it normal? Does anyone has a clue what am I missing here? I'd appreciate any suggestions.

cheers, alex

illumina SNPs nanopore • 2.3k views

ADD COMMENT • link 3.4 years ago by al.grgr • 0

0

Entering edit mode

how certain are you that the same organism is represented in both datasets? Like you, I also think it should match much better if it were indeed the same

ADD REPLY • link 3.4 years ago by Istvan Albert 102k

0

Entering edit mode

DNA is from a fungus which I grew from the same spore stock

ADD REPLY • link 3.4 years ago by al.grgr • 0

score 1 · Answer 1 · 2021-12-13

1

Entering edit mode

3.4 years ago

colindaven 7.4k

I think it's a technology specific problem but am not really sure.

As I understand it, you created the assembly using nanopore alone (not hybrid, or Illumina corrected asm).

You then mapped nanopore reads to the nanopore asm.

Then you mapped Illumina reads to the nanopore asm.

Correct ?

In this case, you are probably seeing sequences specific to one technology (unlikely with so many), or, more probably, a mapping problem with Illumina short reads (though you rightly check for this with MQ20 filter). I'd be fascinated to see why you are picking up more SNV differences, not indel (which I'd expect).

Edit: my key point is - why should nanopore find SNVs in it's own reads vs it's own assembly (from those very reads) ?

You could try taking the same/related fungus genome and mapping Illumina and nanopore reads to this.

Also, you could use a SNV caller, longshot for ONT, freebayes etc for Illumina to quickly check the SNVs.

ADD COMMENT • link 3.4 years ago by colindaven 7.4k

0

Entering edit mode

Thanks for reply! Yes, I just used raw Nanopore reads to produce assembly (I tried both with and without Flye-build-in polishing iterations - does not make a difference). Then I mapped raw Nanopore reads and trimmed/filtered Illumina reads back to assembly.

Indeed, you are right, why would Nanopore track show any difference, since its data was used to produce assembly in the first place.. I guess my wondering is about Illumina data - why would Illumina show so many SNPs and indels (see below, there are indeed lots of insertions)?

enter image description here

cheers, alex

ADD REPLY • link 3.4 years ago by al.grgr • 0

0

Entering edit mode

Try polishing your initial assembly with Racon two-three times (repeat polishing), then pilon to improve the assembly quality.

I bet a lot of your "SNVs " will disappear.

I believe this problem will become a lot easier with Nanopores new 10.3 and 10.4 chemistries, which produce far less homopolymer errors.

Some reading for you on ONT errors:

https://www.nature.com/articles/s41467-020-20340-8

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02282-6

ADD REPLY • link 3.4 years ago by colindaven 7.4k

0

Entering edit mode

Thanks, I'll look into it.

ADD REPLY • link 3.4 years ago by al.grgr • 0

0

Entering edit mode

my key point is - why should nanopore find SNVs in it's own reads vs it's own assembly (from those very reads) ?

Keep in mind that Flye produce a consensus polished sequence

ADD REPLY • link 3.4 years ago by andres.firrincieli 3.9k

0

Entering edit mode

True, but if he polishes the ONT as described I bet most of these "differences" will disappear.

ADD REPLY • link 3.4 years ago by colindaven 7.4k

0

Entering edit mode

From flye you get only a consensus sequence (assembly) not the corrected reads

ADD REPLY • link 3.4 years ago by andres.firrincieli 3.9k

0

Entering edit mode

Sure, if you want corrected reads (haplotype info, eg for phasing not maintained) go for Canu for long read only.

Alternatively, for hybrid correction of long reads by short, https://github.com/DecodeGenetics/Ratatosk worked pretty well for me.

ADD REPLY • link 3.4 years ago by colindaven 7.4k

0

Entering edit mode

Indeed, in Nanopore BAM track i did see occasional "heterozygous" sites.

ADD REPLY • link 3.4 years ago by al.grgr • 0

score 0 · Answer 2 · 2021-12-16

Just to conclude this post. Hard way I discovered that my Illumina samples were horribly mixed up with respect to Nanopore, hence all these SNPs between Illumina and Nanopore. When a correct Illumina sample was mapped onto Nanopore assemblies it looked very clean SNP-wise (though with many single-nucleotide insertions). I hope this flaw did not mislead you into thinking that Nanopore introduces lot's of SNPs.

cheers, alex