Question

Bad quality fastq files for analysis

0

Entering edit mode

2.2 years ago

Gene_MMP8 ▴ 240

I am working on a project that requires me to identify multiple fastq files with low quality. What can be a possible starting point for this sort of data search?

quality bad DNA-seq fastq alignment • 2.2k views

ADD COMMENT • link 2.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Bad in what way?

ADD REPLY • link 2.2 years ago by rpolicastro 13k

0

Entering edit mode

Based on fatsqc scores. The boxplots produced by fastqc display quality scores on all bases. Usually, a score of 30 and above is considered good quality. Is there a way to extract multiple files (~100) that don't pass this threshold?

ADD REPLY • link 2.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Do you already have data and want to identify bad ones or do you need to download files that are bad, e.g. from GEO? I do not really get that.

ADD REPLY • link 2.2 years ago by ATpoint 88k

0

Entering edit mode

I don't have the data available. I want to identify such datasets. The overall aim is to determine what factors influence fastq data quality. For that, I already have a set of features available. All I need is labeled measurements from 100s-1000s of fatsq files.

ADD REPLY • link 2.2 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

2.2 years ago

shelkmike ★ 1.6k

You can run "seqkit stats" (https://bioinf.shenwei.me/seqkit/usage/#stats) for all these files. And, then, classify them into "bad" and "good" based on, for example, Q30.

ADD COMMENT • link 2.2 years ago by shelkmike ★ 1.6k

score 2 · Accepted Answer · 2023-03-19

You can simply make simulated files/data with any features you like. Use randomreads.sh from BBTools or a similar tool.

Illumina quality parameters:
maxq=36         Upper bound of quality values.
midq=28         Approximate average of quality values.
minq=20         Lower bound of quality values.
q=              Sets maxq, midq, and minq to the same value.
adderrors=t     Add substitution errors based on quality values, 
                after mutations.
qv=4            Vary the base quality of reads by up to this much
                to simulate tile effects.