I have used a trimmer to trim my reads, based on their qualities. My raw reads were 126- 150 bp. I did not set any minimum length to filter out short reads. Hence, I have reads as small as 0-4 bps in my processed file. I need to do a reference based assembly using these paired end reads. I am afraid of losing paired information by filtering out the shorter reads.
What should my minimum length be?? I have a 40 x coverage with the raw reads.
Run FASTQC on your files, and see where the quality drops off in terms of sequence length. I'm assuming you have 2x150 bp reads, so your quality will start to drop off around 125 bp. You will lose some pairs in the process of filtering out both quality, and length. Ideally, keep the pairs that pass both of the QC processes, and then align to your reference.
Thanks for your reply! I used the fastq quality trimmer from the fastX - toolkit. I trimmed just one file, which had my forward reads, using a quality threshold of 30. And yes, my reads are Illumina- 150 bp in length.
What do you think is a good length to set as a filter?
Thanks for your reply Brian!
I used the fastq quality trimmer that came with the fastX - toolkit and trimmed with a quality threshold of 30. I want to use these reads for a reference based genome assembly.
30 is too high for almost all purposes. I would suggest trying a much lower threshold like Q20 or Q15 and see how it impacts assembly. And I do not recommend fastX as it is (as you have found) quite slow and cannot handle paired reads.
I am afraid of losing paired information by filtering out the shorter
reads.
Having very short reads may make the assembly difficult/unusable. You could use reformat.sh from BBMap suite to filter your trimmed paired end reads while keeping them in sync. reformat.sh in1=read1.fq in2=read2.fq out1=filt1.fq out2=filt.fq ml=30 (change ml= as needed).
Thanks for your suggestion! From your post, I can infer that reformat.sh would retain only the reads that come in pairs, with a minimum length of 30 (if ml=30). Is this right??
Yes for the first question. Yes it does for second. bbduk.sh is the scan/trim program in BBMap suite. If you are doing reference based assembly they you could afford to trim at a lower Q threshold. Q20 or better may fine.
Can you compare the speeds of bbduk.sh and fastX quality trimmer? I found fastX to be pretty slow.
Perhaps, I'll try both Q 20 and 30, separately and see how good my depth and breadth are.
Run FASTQC on your files, and see where the quality drops off in terms of sequence length. I'm assuming you have 2x150 bp reads, so your quality will start to drop off around 125 bp. You will lose some pairs in the process of filtering out both quality, and length. Ideally, keep the pairs that pass both of the QC processes, and then align to your reference.
Thanks for your reply! I used the fastq quality trimmer from the fastX - toolkit. I trimmed just one file, which had my forward reads, using a quality threshold of 30. And yes, my reads are Illumina- 150 bp in length. What do you think is a good length to set as a filter?
It sounds like you may have trimmed to severely. What quality threhold did you use, with which program, and what are you planning to use the data for?
Thanks for your reply Brian! I used the fastq quality trimmer that came with the fastX - toolkit and trimmed with a quality threshold of 30. I want to use these reads for a reference based genome assembly.
30 is too high for almost all purposes. I would suggest trying a much lower threshold like Q20 or Q15 and see how it impacts assembly. And I do not recommend fastX as it is (as you have found) quite slow and cannot handle paired reads.