Why Longer Reads Must Be Trimmed Or Divided Into 36Bp ?
2
0
Entering edit mode
12.9 years ago
Liuyunlong ▴ 130

Diversity of Human Copy Number Variation and Multicopy Genes ,before mapping reads to reference genome, there is a step in reads preprocessing pipeline: "All reads exceeding 36 base pairs (bp) in length were truncated to 36 bp, or divided into their constituent nonoverlapping 36-bp sequences to eliminate potential mapping biases between genomes sequenced at different read lengths." Is this necessary? and why 36 bp? If I have a dataset that most read's lengths of all sample are about 95~100bp after qc, Can I just trim all reads into uniform length like 95bp? If I use mrfast and divide longer reads into 36bp,which tools can help me deal with pair-end sequences to keep them pairing after divided。

or is it necessary just because of limitation of alignment tool,mrfast ? bwa, dynamicly handle reads of different length, doesn't have this problem ,right ? Any advice will be helpful.thanks

cnv short aligner • 3.2k views
ADD COMMENT
1
Entering edit mode

What is it that you want to do with the mapped reads?

ADD REPLY
1
Entering edit mode

to predict copy number variation with read depth-based methods

ADD REPLY
0
Entering edit mode

Do they have 50bp SOLiD reads? Maybe they just wanted to make sure they've gotten rid of all the adapters, so they chose a fixed restrictive length of 36 for everything.

ADD REPLY
5
Entering edit mode
12.9 years ago

From a quick glance at the abstract of that paper, I'm guessing that they wanted to be able to directly compare the results across many samples that were sequenced with different read lengths. Under typical circumstances, there shouldn't be any reason to split your reads up. In fact, longer reads allow you to map into repetitive regions that shorter reads can't access. This enhances your ability to detect CNV in these potentially unstable regions.

ADD COMMENT
0
Entering edit mode

thanks, agree. In the other hand, if different length reads from different samples (like 36 76 100 120 etc) were used to map and call reads depth, do the predicted CNV results have any bias except the potential mapping biases?

ADD REPLY
0
Entering edit mode

I'm not sure exactly what you mean by that. The read depth in repetitive regions is going to be affected by the length of the reads. If this isn't explicitly corrected for, you might inadvertently call CNA that don't exist. In your case, where all reads are 95-100 bp, I wouldn't worry about that slight difference, but I would still choose an algorithm that does explicit correction for mapability.

ADD REPLY
1
Entering edit mode
12.9 years ago
Gustavo ▴ 530

Another possible consideration: certain short read mappers can accept only a small number of mismatches to the reference before they fail to map the read. Longer reads have a higher probability of accruing mismatches for a given error rate... and for some technologies the error rate increases with read length.

It is possible for longer reads to have a lower mapping rate than shorter ones (of course, too short reads have higher mapping ambiguity). A simple method for making samples comparable is to trim the reads as described.

ADD COMMENT

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6