Question

Why Longer Reads Must Be Trimmed Or Divided Into 36Bp ?

0

Entering edit mode

13.0 years ago

Liuyunlong ▴ 130

Diversity of Human Copy Number Variation and Multicopy Genes ,before mapping reads to reference genome, there is a step in reads preprocessing pipeline： "All reads exceeding 36 base pairs (bp) in length were truncated to 36 bp, or divided into their constituent nonoverlapping 36-bp sequences to eliminate potential mapping biases between genomes sequenced at different read lengths." Is this necessary? and why 36 bp? If I have a dataset that most read's lengths of all sample are about 95~100bp after qc, Can I just trim all reads into uniform length like 95bp? If I use mrfast and divide longer reads into 36bp，which tools can help me deal with pair-end sequences to keep them pairing after divided。

or is it necessary just because of limitation of alignment tool,mrfast ? bwa, dynamicly handle reads of different length, doesn't have this problem ,right ? Any advice will be helpful.thanks

cnv short aligner • 3.3k views

ADD COMMENT • link updated 13.0 years ago by Gustavo ▴ 530 • written 13.0 years ago by Liuyunlong ▴ 130

1

Entering edit mode

What is it that you want to do with the mapped reads?

ADD REPLY • link 13.0 years ago by Sean Davis 27k

1

Entering edit mode

to predict copy number variation with read depth-based methods

ADD REPLY • link 13.0 years ago by Liuyunlong ▴ 130

0

Entering edit mode

Do they have 50bp SOLiD reads? Maybe they just wanted to make sure they've gotten rid of all the adapters, so they chose a fixed restrictive length of 36 for everything.

ADD REPLY • link 13.0 years ago by Damian Kao 16k

score 5 · Answer 1 · 2012-01-04

5

Entering edit mode

13.0 years ago

Chris Miller 22k

From a quick glance at the abstract of that paper, I'm guessing that they wanted to be able to directly compare the results across many samples that were sequenced with different read lengths. Under typical circumstances, there shouldn't be any reason to split your reads up. In fact, longer reads allow you to map into repetitive regions that shorter reads can't access. This enhances your ability to detect CNV in these potentially unstable regions.

ADD COMMENT • link 13.0 years ago by Chris Miller 22k

0

Entering edit mode

thanks, agree. In the other hand, if different length reads from different samples (like 36 76 100 120 etc) were used to map and call reads depth, do the predicted CNV results have any bias except the potential mapping biases？

ADD REPLY • link 13.0 years ago by Liuyunlong ▴ 130

0

Entering edit mode

I'm not sure exactly what you mean by that. The read depth in repetitive regions is going to be affected by the length of the reads. If this isn't explicitly corrected for, you might inadvertently call CNA that don't exist. In your case, where all reads are 95-100 bp, I wouldn't worry about that slight difference, but I would still choose an algorithm that does explicit correction for mapability.

ADD REPLY • link 13.0 years ago by Chris Miller 22k

score 1 · Answer 2 · 2012-01-05

Another possible consideration: certain short read mappers can accept only a small number of mismatches to the reference before they fail to map the read. Longer reads have a higher probability of accruing mismatches for a given error rate... and for some technologies the error rate increases with read length.

It is possible for longer reads to have a lower mapping rate than shorter ones (of course, too short reads have higher mapping ambiguity). A simple method for making samples comparable is to trim the reads as described.