Hi,
I'm working on reads from SOLiD data. I trimmed the reads using quality scores and at the end, I ended up with reads at an avg length of 21 nt (before trimming, average length was 41 nt). I'm worried if I trimmed the reads a bit too much.
I looked into the quality report after trimming. About 4.4% of the reads have 1bp, 3.8% have 2bp, 4% have 3bp, 4.3% have 4bp. But, there are nearly 14% of the reads with 50bp length and 8% of the reads with 35 bp length. All other lengths each occupy approx 2%.
Note: The max length of the reads is 50bp.
Since, I have 16.5% (4.4 + 3.8 + 4 + 4.3 ) of reads with less than 5bp length, I'm worried if this read file could be used to map against a reference genome (human). Can anyone throw some light on this or may be include papers which talk about dealing with such issues. Thanks!
I use CLC for my analysis. I have paired reads (35bp for forward and 50bp for reverse). So when I trim, the reads are bound to be below 35bp. Also in read like
T131210.12
, the trimming results intoT131210
as'.'
is an ambiguous nucleotide analogous to'N'
in base nucleotides. So if the'.'
is in the beginning of the sequence, you are bound to lose lot of bp in a read. So that kind of explains why there is such heavy trimming.And I'm using a Phred score of 17 as the limit, which I thought was alright as 70% of the sequences had a quality >=15. So you are saying that mapping with such heavy trimming could give bad results?
If you use a color space aware aligner then having dots in the sequence does not affect the alignment. The quality trimming is also independent of the sequence. In general it is better to not overdo trimming usually I would say that it is better to keep a read longer with lower end qualities than trimming it back to enforce an average read quality. Again it depends on the aligner but many tools will readily clip the ends if that makes for a better alignment.
I will try to redo the analysis on what you said and see how the results are. But how do I know which is a better measure?
One more thing is that it is natural to loose some reads because of trimming. By "loose" I mean that trimming makes their size too small so that it can't be used for correct mapping. Normally people dont use reads below 30 bp and simply discard them. Can you tell in percentage how many of your reads out of the total reads were able to pass the filtering test.
I'm not exactly sure what you mean by pass. But if you are referring to how many reads were trimmed, 97.13% were trimmed (at least by 1 bp).
Assuming the reads you have got are of poor quality and if you are loosing (read length becoming less than 30 bp) 10-20% of the reads because of trimming. I would think it is fine. But if you are loosing 70% of the reads then it might be a problem because you wont get enough coverage.