Question

Ideally What Should The Reads Length Be Before Mapping To A Reference?

0

Entering edit mode

11.8 years ago

Jordan ★ 1.3k

Hi,

I'm working on reads from SOLiD data. I trimmed the reads using quality scores and at the end, I ended up with reads at an avg length of 21 nt (before trimming, average length was 41 nt). I'm worried if I trimmed the reads a bit too much.

I looked into the quality report after trimming. About 4.4% of the reads have 1bp, 3.8% have 2bp, 4% have 3bp, 4.3% have 4bp. But, there are nearly 14% of the reads with 50bp length and 8% of the reads with 35 bp length. All other lengths each occupy approx 2%.

Note: The max length of the reads is 50bp.

Since, I have 16.5% (4.4 + 3.8 + 4 + 4.3 ) of reads with less than 5bp length, I'm worried if this read file could be used to map against a reference genome (human). Can anyone throw some light on this or may be include papers which talk about dealing with such issues. Thanks!

mapping trimming • 2.6k views

ADD COMMENT • link updated 11.8 years ago by Ashutosh Pandey 12k • written 11.8 years ago by Jordan ★ 1.3k

score 0 · Answer 1 · 2013-02-13

0

Entering edit mode

11.8 years ago

Ashutosh Pandey 12k

You trimmed a lot. Personally, I wont go below than 35 bp in any case. I am not sure why the trimming resulted into loss of half of the read. May be u r doing sth wrong. Make sure that you are using the right phred score for trimming. Most of the reads should only loose 5-10 bases in the end. I think you should see into the different phred scores for different platforms.

ADD COMMENT • link 11.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I use CLC for my analysis. I have paired reads (35bp for forward and 50bp for reverse). So when I trim, the reads are bound to be below 35bp. Also in read like T131210.12, the trimming results into T131210 as '.' is an ambiguous nucleotide analogous to 'N' in base nucleotides. So if the '.' is in the beginning of the sequence, you are bound to lose lot of bp in a read. So that kind of explains why there is such heavy trimming.

And I'm using a Phred score of 17 as the limit, which I thought was alright as 70% of the sequences had a quality >=15. So you are saying that mapping with such heavy trimming could give bad results?

ADD REPLY • link 11.8 years ago by Jordan ★ 1.3k

0

Entering edit mode

If you use a color space aware aligner then having dots in the sequence does not affect the alignment. The quality trimming is also independent of the sequence. In general it is better to not overdo trimming usually I would say that it is better to keep a read longer with lower end qualities than trimming it back to enforce an average read quality. Again it depends on the aligner but many tools will readily clip the ends if that makes for a better alignment.

ADD REPLY • link 11.8 years ago by Istvan Albert 102k

0

Entering edit mode

I will try to redo the analysis on what you said and see how the results are. But how do I know which is a better measure?

ADD REPLY • link 11.8 years ago by Jordan ★ 1.3k

0

Entering edit mode

One more thing is that it is natural to loose some reads because of trimming. By "loose" I mean that trimming makes their size too small so that it can't be used for correct mapping. Normally people dont use reads below 30 bp and simply discard them. Can you tell in percentage how many of your reads out of the total reads were able to pass the filtering test.

ADD REPLY • link 11.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I'm not exactly sure what you mean by pass. But if you are referring to how many reads were trimmed, 97.13% were trimmed (at least by 1 bp).

ADD REPLY • link 11.8 years ago by Jordan ★ 1.3k

0

Entering edit mode

Assuming the reads you have got are of poor quality and if you are loosing (read length becoming less than 30 bp) 10-20% of the reads because of trimming. I would think it is fine. But if you are loosing 70% of the reads then it might be a problem because you wont get enough coverage.

ADD REPLY • link 11.8 years ago by Ashutosh Pandey 12k