This is a basic question; however, I couldn't find an answer anywhere. Traditionally, the quality score of a base in FASTQ indicates the probability that this base is wrong. This is reasonable for Illumina, where the typical sequencing error is a single base substitution (for example, "A" occurs in a sequencing read where "G" in fact should be). However, for some sequencing machines, like the sequencing machines of Oxford Nanopore Technologies (ONT), deletions and insertions are also frequent. In a FASTQ file, each base of a read has exactly one symbol denoting its quality.
For example, the problem arises in this case: There is a read with the sequence ATTGCTAC. Probabilities that all bases are correct are 100% (let's simplify), but there is a very possible insertion of TAT between G and C. How can the probability of this insertion be encoded in FASTQ, if each quality symbol in FASTQ corresponds strictly to one base of a read?
My main questions are: 1) Do FASTQ files with ONT reads incorporate probabilities of insertions and deletions or they take into account only probabilities of single base substitutions? 2) If probabilities of indels are encoded in FASTQ, how exactly is it made?
I will be grateful for help
i think such information comes from mapper/aligner (in reference based assemblies). read about sam format (most followed alignment format) esp CIGAR strings. shelkmike