We are sequencing a bacterial genome with a Gridion machine from ONT. As already expected, the error rate was quite high and I noticed lots of insertions and deletions compared to the reference genome.
Although I reckon the sequencing and the mapping to the reference genome both went well, I was wondering if 'polishing' the fastq files could improve the mapping stats. For instance, I checked in this post that the quality of the first 40-50 nucleotides in the reads tend to be low. Also I wanted to evaluate if/ to which extent selecting reads of of higher quality (e.g. 12 (phred scale), which is actually the median of reads quality) would enhance alignment.
I am now wondering how could I evaluate the mapping of the reads after polishing/ filtering the fastq files described above. I initially checked the alignment with tablet, but I am more after a quantitative (other than visual) assessment.
So far I have used samtools flagstat
which gave me:
7014141 + 0 in total (QC-passed reads + QC-failed reads)
781294 + 0 secondary
409167 + 0 supplementary
0 + 0 duplicates
6836429 + 0 mapped (97.47% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I guess the percentage of mapped reads (97.47%) can be useful, but it is already a high mapping rate and I am not really expecting that trimming the first nucleotides will increase the mapping rate (please correct me in case I am interpreting the samtools flagstat
output wrongly).
I also found a tool called Qualimap, though it appears to be computationally very expensive and the command line tool does not appear to work on linux.
Has anyone already carried out this analysis i.e. how can one assess the improvement in the mapping to the reference genome after polishing the fastq files?
Sequence alignment was done with minimap2
and indexing, sorting with samtools
I have found this tool called MUMmer which has a function called
dnadiff
. I think I can then export the consensus sequence from the alignment using the 'raw' reads and that of the 'polished reads' and compare each of them to the reference in order to check if there was an improvement. Any suggestion?