Question

How to evaluate sequence alignment (e.g. number of indels) of ONT data after data preprocessing

0

Entering edit mode

6.2 years ago

BCArg ▴ 90

We are sequencing a bacterial genome with a Gridion machine from ONT. As already expected, the error rate was quite high and I noticed lots of insertions and deletions compared to the reference genome.

Although I reckon the sequencing and the mapping to the reference genome both went well, I was wondering if 'polishing' the fastq files could improve the mapping stats. For instance, I checked in this post that the quality of the first 40-50 nucleotides in the reads tend to be low. Also I wanted to evaluate if/ to which extent selecting reads of of higher quality (e.g. 12 (phred scale), which is actually the median of reads quality) would enhance alignment.

I am now wondering how could I evaluate the mapping of the reads after polishing/ filtering the fastq files described above. I initially checked the alignment with tablet, but I am more after a quantitative (other than visual) assessment.

So far I have used samtools flagstat which gave me:

7014141 + 0 in total (QC-passed reads + QC-failed reads)
781294 + 0 secondary
409167 + 0 supplementary
0 + 0 duplicates
6836429 + 0 mapped (97.47% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I guess the percentage of mapped reads (97.47%) can be useful, but it is already a high mapping rate and I am not really expecting that trimming the first nucleotides will increase the mapping rate (please correct me in case I am interpreting the samtools flagstat output wrongly).

I also found a tool called Qualimap, though it appears to be computationally very expensive and the command line tool does not appear to work on linux.

Has anyone already carried out this analysis i.e. how can one assess the improvement in the mapping to the reference genome after polishing the fastq files?

Sequence alignment was done with minimap2 and indexing, sorting with samtools

alignment sequencing next-gen • 2.3k views

ADD COMMENT • link updated 6.1 years ago by colindaven 7.8k • written 6.2 years ago by BCArg ▴ 90

0

Entering edit mode

I have found this tool called MUMmer which has a function called dnadiff. I think I can then export the consensus sequence from the alignment using the 'raw' reads and that of the 'polished reads' and compare each of them to the reference in order to check if there was an improvement. Any suggestion?

ADD REPLY • link 6.2 years ago by BCArg ▴ 90

score 0 · Answer 1 · 2019-07-23

qualimap might not be modified and appropriate for ONT data
If you want to easily correct the ONT reads do an assembly with Canu, it will output corrected reads as part of the assembly process if all goes well
97% is already a great alignment rate
mummer is good but does not deliver SAM output AFAIK
there is a new tool called FiltLong which might help you filter reads if you like
if you want more stats on the alignment try
```
samtools stats
```

then

multiqc

to get a nice report