Entering edit mode
23 months ago
tanbiswas6
▴
10
Hi
I am doing WES data analysis and it failed at per base sequence content. I has some sequence duplication also. Below is a snapshot of my data.
Please let me know how to process this file.
Thank you.
You have no adenin at your 4th read across all sequences. In general the first 7 to 8 reads are bad. If thats an option for you, just omit them or ignore them
Thanks for the suggestion. Can you please suggest how to remove the first 7-8 reads without disturbing any other reads in the file?
Thanks.
There is likely no need to do any processing at this point. If there is a problem located with the data in downstream analysis then you can come back and dig into this more. FastQC limits are designed for plain genomic sequencing. Depending on kind of experiment there may be "failures" on one or more tests. This does not automatically mean that the data has a problem or is bad.
While it is a bit odd to have majority T's at cycle 4 the data may still be fine.
Yes. That's where my concern is. I know that other reads are fine but if I use this file without removing those reads will not e there some problem while data analysis or publishing?
Most likely not, but if you want to be absolutely safe you can trim away the first 7 bases of all reads, tools like seqtk can do that.
How do you know that. Since FastQC sub-samples your data (it does not look at every read in your file) you at least have enough reads with that pattern in sample it takes.
You can use
bbduk.sh
from BBMap suite to trim the first 7-8 bases like so