Question

How are probabilities of insertions and deletions encoded in FASTQ?

1

Entering edit mode

6.5 years ago

shelkmike ★ 1.7k

This is a basic question; however, I couldn't find an answer anywhere. Traditionally, the quality score of a base in FASTQ indicates the probability that this base is wrong. This is reasonable for Illumina, where the typical sequencing error is a single base substitution (for example, "A" occurs in a sequencing read where "G" in fact should be). However, for some sequencing machines, like the sequencing machines of Oxford Nanopore Technologies (ONT), deletions and insertions are also frequent. In a FASTQ file, each base of a read has exactly one symbol denoting its quality.

For example, the problem arises in this case: There is a read with the sequence ATTGCTAC. Probabilities that all bases are correct are 100% (let's simplify), but there is a very possible insertion of TAT between G and C. How can the probability of this insertion be encoded in FASTQ, if each quality symbol in FASTQ corresponds strictly to one base of a read?

My main questions are: 1) Do FASTQ files with ONT reads incorporate probabilities of insertions and deletions or they take into account only probabilities of single base substitutions? 2) If probabilities of indels are encoded in FASTQ, how exactly is it made?

I will be grateful for help

fastq phred quality score nanopore pacbio indels • 3.1k views

ADD COMMENT • link updated 5 months ago by Jeremy Leipzig 23k • written 6.5 years ago by shelkmike ★ 1.7k

0

Entering edit mode

i think such information comes from mapper/aligner (in reference based assemblies). read about sam format (most followed alignment format) esp CIGAR strings. shelkmike

ADD REPLY • link 6.5 years ago by cpad0112 21k

score 3 · Answer 1 · 2019-02-28

3

Entering edit mode

6.5 years ago

Devon Ryan 105k

No, fastq files in only contain per-base call quality scores. There's no information about the likelihood of an InDel.
N/A

InDels tend to be randomly distributed in nanopore data, with the exception of an enrichment in homopolymer stretches.

ADD COMMENT • link 6.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you. Also, can you give a link to a source where I can read about this?

ADD REPLY • link 6.5 years ago by shelkmike ★ 1.7k

0

Entering edit mode

Any review of ONT data should talk about InDel distributions, I've seen ONT talk about it in their presentations even. For phred scores, there won't be anything that mentions that.

ADD REPLY • link 6.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you once again. Sorry for doubts, but if nothing mentions it, how do you know that probabilities of indels are not reflected in FASTQ in some way?

ADD REPLY • link 6.5 years ago by shelkmike ★ 1.7k

2

Entering edit mode

Because FASTQ files aren't structured in a way that would permit that.

ADD REPLY • link 6.5 years ago by Devon Ryan 105k

1

Entering edit mode

Fastq quality scores encode the probability of this specific nucleotide being in that specific position of the read. It doesn't know anything about variants (neither SNPs or indels), because that you only get by comparing it to the reference genome.

ADD REPLY • link 6.5 years ago by WouterDeCoster 48k

0

Entering edit mode

By indels I mean not genomic variants, but sequencing errors which result in insertion or deletion of a sequence in a sequencing read compared to the genome.

ADD REPLY • link 6.5 years ago by shelkmike ★ 1.7k

1

Entering edit mode

Well, at the location of a false-deletion-sequencing-error the pore essentially skipped a few nucleotides - didn't read them carefully enough. This might result in a lower quality for the nucleotides surrounding this fake-deletion, but this doesn't inform us about what the cause might be - the indel. So no, the probability of an indel is not explicitly encoded in the fastq.

ADD REPLY • link 6.5 years ago by WouterDeCoster 48k

0

Entering edit mode

it is possible for q scores to reflect homopolymer errors, but not indels. I don't know enough about ONT to say how the signal is processed to produce Q scores. The example the OP provided would not fit the category of a homopolymer.

ADD REPLY • link 6.5 years ago by Jeremy Leipzig 23k

0

Entering edit mode

The sequences in the fastq files represent one molecule. The terms insertions/deletion only works in comparison to something.

The quality scores given in the fastq files are the result by comparing the signal to noise. At each measure point there must be a signal. It is not possible that you sometimes measure nothing ("because there is an deletion") and get a signal again to a later point. The DNA molecule is continuous and have no spaces.

ADD REPLY • link 6.5 years ago by finswimmer 16k

score 1 · Answer 2 · 2025-02-25

1

Entering edit mode

5 months ago

shelkmike ★ 1.7k

Answering my own question.
1) The Phred quality scores of Oxford Nanopore reads take into account the probabilities of indel errors (https://labs.epi2me.io/quality-scores/).
2) Same for PacBio reads (personal communication with PacBio tech support).

ADD COMMENT • link 5 months ago by shelkmike ★ 1.7k

0

Entering edit mode

Yes they calibrate the base qualities called off the machine so that they add up to the real world empirical quality of your read, but there's nothing that they can tell you in a fastq that says "hey this TAT is likely an artefactual insertion, rather than 3 substitutions"

ADD REPLY • link 5 months ago by Jeremy Leipzig 23k