Question

Nanopore Data Quality Check

1

Entering edit mode

3 months ago

Umer ▴ 130

Hello.

I have received my first ever nanopore sequences. They come from the genomic DNA of a Fungal species. I have experience analyzing Illumina sequences but none for ONT.

Background Information:

Organism: Fusarium
Target: Genome Assembly
I have Illumina data too for the same samples.

We recieved data in following files

fast5.pass.tar
fast5.fail.tar
fastq.pass.tar
fastq.fail.tar

What I did:

Used only fastq.pass data for downstram analysis by merging all fastq files in one FastQ file for each sample.
Ran NANOSTAT on raw fastq
Ran PORECHOP on raw fastq
Ran NANOSTAT on porechop output

and I got the following results

Nanostat results on raw and porechop processed fastq

When I run fastqc on my RAW nanopore data, It shows that I have adapter content which is polyA and polyG (image attached).

fastQC adapters

Even after running PORECHOP these polyA and polyG were still showing up in fastqc report. I see some over represented sequences in in Raw data fastqc report but after running porechop there are non.

MultiQC report is as below.

MultiQC report Adapter Content

Assembly via Flye 2.9.4-b1799 results for same sample using 3 iteration before and after running QC gives following results. Flye Assembly

QUESTIONS: keeping the target "Genome Assembly and annotation" in mind

Is it necessary to run poreChop on the raw data again? The report from sequencing company says that they removed the adapters and did basic QC on the data.
is it important to remove these polyA and polyQ adapters ? will it effect the assembly ?
If YES for question.2, then which tool can do both ? + Should I run this tool on data already processed by porechop ?
Based on assembly stats, N50 increased with negligible increase in number of contigs, so what is your openion on this, should I use porechop processed data for downstream analysis of just raw fastq files.

QC Fungi genome-assembly Nanopore • 500 views

ADD COMMENT • link 3 months ago by Umer ▴ 130

0

Entering edit mode

Use a proper program for nanopore data: https://github.com/a-slide/pycoQC

Do you actually see those poly=A/G in your data? Perhaps that is some sort of artifact because of FastQC?

ADD REPLY • link 3 months ago by GenoMax 147k

0

Entering edit mode

I checked and there are log stretches of A and G in fastq files.

I did run pucoQC on the summary.txt file for the same data. html file Google Drive LinK

this doesnot show anything related to adapters.

ADD REPLY • link 3 months ago by Umer ▴ 130

1

Entering edit mode

It has been a while since I worked with fungal sequences but long stretches of poly-A/-G's nonetheless sound suspicious. Perhaps someone else will have an input.

If you want to remove poly-A/-G then bbduk.sh (or for that matter fastp) should be able to do this. Question is are they real though and should be left alone.

The report from sequencing company says that they removed the adapters and did basic QC on the data.

If that is the case then running any additional chopping is likely not warranted.

ADD REPLY • link 3 months ago by GenoMax 147k

0

Entering edit mode

Having a few poly A and G in the reads is not such a problem. The real issue is if they are included in the assemblies ? Why not do a few kmer analyses or even blast or grep to find out if long stretches of your assemblies are problematic - I doubt it.

I don't do these checks on our nanopore assemblies and have never been confronted by errors on fungal or plant genomes.

That said - I think the nanopore tool QC ecosystem could definitely be improved more, especially considering adapters. It's worth noting there is a fork of porechop which was still being maintained when I last looked - https://github.com/bonsai-team/Porechop_ABI

ADD REPLY • link 3 months ago by colindaven 7.0k

0

Entering edit mode

I just did a quick CTRL+F search in one of the assembly.fasta file and found that

A are present upto 27 consecutive bases string as AAAAAAAAAAAAAAAA giving 3 results

G are present upto 23 consecutive bases string as GGGGGGGGGGGGGG giving 1 result and at 20 bases giving 3 results

will these be problematic? if yes, how should i remove them ?

ADD REPLY • link 3 months ago by Umer ▴ 130