I am working on a project that requires me to identify multiple fastq files with low quality. What can be a possible starting point for this sort of data search?
Based on fatsqc scores. The boxplots produced by fastqc display quality scores on all bases. Usually, a score of 30 and above is considered good quality. Is there a way to extract multiple files (~100) that don't pass this threshold?
I don't have the data available. I want to identify such datasets. The overall aim is to determine what factors influence fastq data quality. For that, I already have a set of features available. All I need is labeled measurements from 100s-1000s of fatsq files.
You can simply make simulated files/data with any features you like. Use randomreads.sh from BBTools or a similar tool.
Illumina quality parameters:
maxq=36 Upper bound of quality values.
midq=28 Approximate average of quality values.
minq=20 Lower bound of quality values.
q= Sets maxq, midq, and minq to the same value.
adderrors=t Add substitution errors based on quality values,
after mutations.
qv=4 Vary the base quality of reads by up to this much
to simulate tile effects.
Bad in what way?
Based on fatsqc scores. The boxplots produced by fastqc display quality scores on all bases. Usually, a score of 30 and above is considered good quality. Is there a way to extract multiple files (~100) that don't pass this threshold?
Do you already have data and want to identify bad ones or do you need to download files that are bad, e.g. from GEO? I do not really get that.
I don't have the data available. I want to identify such datasets. The overall aim is to determine what factors influence fastq data quality. For that, I already have a set of features available. All I need is labeled measurements from 100s-1000s of fatsq files.