Diversity of Human Copy Number Variation and Multicopy Genes ,before mapping reads to reference genome, there is a step in reads preprocessing pipeline:
"All reads exceeding 36 base pairs (bp) in length were truncated to 36 bp, or
divided into their constituent nonoverlapping 36-bp sequences to eliminate potential mapping biases
between genomes sequenced at different read lengths."
Is this necessary? and why 36 bp? If I have a dataset that most read's lengths of all sample are about 95~100bp after qc, Can I just trim all reads into uniform length like 95bp? If I use mrfast and divide longer reads into 36bp,which tools can help me deal with pair-end sequences to keep them pairing after divided。
or is it necessary just because of limitation of alignment tool,mrfast ? bwa, dynamicly handle reads of different length, doesn't have this problem ,right ?
Any advice will be helpful.thanks
Do they have 50bp SOLiD reads? Maybe they just wanted to make sure they've gotten rid of all the adapters, so they chose a fixed restrictive length of 36 for everything.
From a quick glance at the abstract of that paper, I'm guessing that they wanted to be able to directly compare the results across many samples that were sequenced with different read lengths. Under typical circumstances, there shouldn't be any reason to split your reads up. In fact, longer reads allow you to map into repetitive regions that shorter reads can't access. This enhances your ability to detect CNV in these potentially unstable regions.
thanks, agree. In the other hand, if different length reads from different samples (like 36 76 100 120 etc) were used to map and call reads depth, do the predicted CNV results have any bias except the potential mapping biases?
I'm not sure exactly what you mean by that. The read depth in repetitive regions is going to be affected by the length of the reads. If this isn't explicitly corrected for, you might inadvertently call CNA that don't exist. In your case, where all reads are 95-100 bp, I wouldn't worry about that slight difference, but I would still choose an algorithm that does explicit correction for mapability.
Another possible consideration: certain short read mappers can accept only a small number of mismatches to the reference before they fail to map the read. Longer reads have a higher probability of accruing mismatches for a given error rate... and for some technologies the error rate increases with read length.
It is possible for longer reads to have a lower mapping rate than shorter ones (of course, too short reads have higher mapping ambiguity). A simple method for making samples comparable is to trim the reads as described.
What is it that you want to do with the mapped reads?
to predict copy number variation with read depth-based methods
Do they have 50bp SOLiD reads? Maybe they just wanted to make sure they've gotten rid of all the adapters, so they chose a fixed restrictive length of 36 for everything.