Hi,
I have a dataset generated by illumina MiSeq. FastQC failed the per base sequence quality and sequence length distribution modules.
I did a quality trim using a sliding window 5, step 2 and min quality score of 20. and filtered reads less than 70bp. This removes low quality bases but when look at the sequence length distribution, I noticed that the number of reads of length 300 were reduced from nearly 500,000 to 160,000. Appreciate any advice on this.
Thanks
Can you provide details (images) of your
FASTQC
results? Can you elaborate what you mean by "number of reads of length 300 were reduced from nearly 500,000 to 160,000"?Thanks a lot. I'm new to this kind of analysis and really appreciate your advices.
The below images show initial per base qualities and sequence lengths distribution
http://www.freeimagehosting.net/upl.php
After trimming and filtering reads <70bp
I hope the images are clear. Am I doing the correct thing? If need any more clarification pls ask me.
Thanks Sumudu
Thanks! Those look fine to me. I might not have trimmed so aggressively (I usually use a phred cutoff of 5) but otherwise that looks correct.
If you want, you can trim reads beyond 250. You may get better alignment.