Hello.
I have received my first ever nanopore sequences. They come from the genomic DNA of a Fungal species. I have experience analyzing Illumina sequences but none for ONT.
Background Information:
- Organism: Fusarium
- Target: Genome Assembly
- I have Illumina data too for the same samples.
We recieved data in following files
- fast5.pass.tar
- fast5.fail.tar
- fastq.pass.tar
- fastq.fail.tar
What I did:
- Used only fastq.pass data for downstram analysis by merging all fastq files in one FastQ file for each sample.
- Ran
NANOSTAT
on raw fastq - Ran
PORECHOP
on raw fastq - Ran
NANOSTAT
on porechop output
and I got the following results
When I run fastqc on my RAW nanopore data, It shows that I have adapter content which is polyA and polyG (image attached).
Even after running PORECHOP these polyA and polyG were still showing up in fastqc report. I see some over represented sequences in in Raw data fastqc report but after running porechop there are non.
MultiQC report is as below.
Assembly via Flye 2.9.4-b1799
results for same sample using 3 iteration before and after running QC gives following results.
QUESTIONS: keeping the target "Genome Assembly and annotation" in mind
- Is it necessary to run
poreChop
on the raw data again? The report from sequencing company says that they removed the adapters and did basic QC on the data. - is it important to remove these polyA and polyQ adapters ? will it effect the assembly ?
- If YES for question.2, then which tool can do both ? + Should I run this tool on data already processed by porechop ?
- Based on assembly stats, N50 increased with negligible increase in number of contigs, so what is your openion on this, should I use porechop processed data for downstream analysis of just raw fastq files.
Use a proper program for nanopore data: https://github.com/a-slide/pycoQC
Do you actually see those poly=A/G in your data? Perhaps that is some sort of artifact because of FastQC?
I checked and there are log stretches of A and G in fastq files.
I did run pucoQC on the summary.txt file for the same data. html file Google Drive LinK
this doesnot show anything related to adapters.
It has been a while since I worked with fungal sequences but long stretches of poly-A/-G's nonetheless sound suspicious. Perhaps someone else will have an input.
If you want to remove poly-A/-G then
bbduk.sh
(or for that matterfastp
) should be able to do this. Question is are they real though and should be left alone.If that is the case then running any additional chopping is likely not warranted.
Having a few poly A and G in the reads is not such a problem. The real issue is if they are included in the assemblies ? Why not do a few kmer analyses or even blast or grep to find out if long stretches of your assemblies are problematic - I doubt it.
I don't do these checks on our nanopore assemblies and have never been confronted by errors on fungal or plant genomes.
That said - I think the nanopore tool QC ecosystem could definitely be improved more, especially considering adapters. It's worth noting there is a fork of porechop which was still being maintained when I last looked - https://github.com/bonsai-team/Porechop_ABI
I just did a quick
CTRL+F
search in one of the assembly.fasta file and found thatA are present upto 27 consecutive bases string as
AAAAAAAAAAAAAAAA
giving 3 resultsG are present upto 23 consecutive bases string as
GGGGGGGGGGGGGG
giving 1 result and at 20 bases giving 3 resultswill these be problematic? if yes, how should i remove them ?