Does any one know whether a assembler can get the accurate assembly of the two ends of the sequence?
Thanks!
Sorry for not making myself clear. Actually, I want to know how the assemblers deal with the end of sequence. How do they know they have reached the end of the sequence while doing assembly? Can they give out accurate end of the sequence?
Why would you need an assembler if you are referring to stitching together R1/R2 reads from a fragment, if they overlap in the middle. bbmerge.sh from BBMap suite or FLASH are appropriate tools to do that read merge.
If you are referring to assembling two ends of a contig to get a closed circle then that is a different question. Please add some additional explanation by editing this post.
I edited this post. Actually, I sequenced the whole genome of a virus. I want to know whether de novo assembly can give out accurate sequence at the end of the genome.
Tadpole seems to do a very good job of viral assembly, so I recommend it for that purpose. Your post's title is misleading, though:
"Assembly of two ends of the sequence"
As far as I can tell, this has nothing to do with your goal, so I suggest you change it. Tadpole can typically generate a good viral assembly from shotgun reads, but your title is completely unrelated to this task.
Edit - I misinterpreted your question. It appears that you want to assemble the outermost ends of your genome... I interpreted as being related to read pairs. Anyway, Tadpole has no problem assembling the ends of viral contigs, as long as they are concordant.
While the ends part is true if OP is willing to try tadpole then it would not hurt to do a second assembly with BBtools. Who knows OP may get a better assembly.
Read* ReadPair::fastMerge(){
Read* rcRight = mRight->reverseComplement();
int len1 = mLeft->length();
int len2 = rcRight->length();
// use the pointer directly for speed
const char* str1 = mLeft->mSeq.mStr.c_str();
const char* str2 = rcRight->mSeq.mStr.c_str();
const char* qual1 = mLeft->mQuality.c_str();
const char* qual2 = rcRight->mQuality.c_str();
// we require at least 30 bp overlapping to merge a pair
const int MIN_OVERLAP = 30;
bool overlapped =false;
int olen = MIN_OVERLAP;
int diff= 0;
// the diff count for 1 high qual + 1 low qual
int lowQualDiff = 0;
while(olen <= min(len1, len2)){diff= 0;
lowQualDiff = 0;
bool ok =true;
int offset = len1 - olen;
for(int i=0;i<olen;i++){
if(str1[offset+i]!= str2[i]){
diff++;
// one is >= Q30 and the other is <= Q15
if((qual1[offset+i]>='?'&& qual2[i]<='0')||(qual1[offset+i]<='0'&& qual2[i]>='?')){
lowQualDiff++;}
// we disallow high quality diff, and only allow up to 3 low qual diff
if(diff>lowQualDiff || lowQualDiff>=3){
ok =false;break;}}}
if(ok){
overlapped =true;break;}
olen++;}
if(overlapped){
int offset = len1 - olen;
int mergedLen = offset + len2;
stringstream ss;
ss << mLeft->mName <<" merged offset:"<< offset <<" overlap:"<< olen <<" diff:"<<diff;
string mergedName = ss.str();
string mergedSeq = mLeft->mSeq.mStr.substr(0, offset) + rcRight->mSeq.mStr;
string mergedQual = mLeft->mQuality.substr(0, offset) + rcRight->mQuality;
// quality adjuction and correction for low qual diff
for(int i=0;i<olen;i++){
if(str1[offset+i]!= str2[i]){
if(qual1[offset+i]>='?'&& qual2[i]<='0'){
mergedSeq[offset+i]= str1[offset+i];
mergedQual[offset+i]= qual1[offset+i];}else{
mergedSeq[offset+i]= str2[i];
mergedQual[offset+i]= qual2[i];}}else{
// add the quality of the pair to make a high qual
mergedQual[offset+i]= qual1[offset+i] + qual2[i] - 33;}}
delete rcRight;return new Read(mergedName, mergedSeq, "+", mergedQual);}
delete rcRight;return NULL;}
BTW, I was wondering whether anyone can tell me how the assemblers deal with the end of sequence. How do they know they have reached the end of the sequence and can they give out accurate end sequence?
Have you benchmarked this against other read-merging programs, to see how well it performs, in terms of true-positive and false-positive merges? Or, in terms of speed? From reading the code, it does not appear that it would perform very well by either metric...
Why would you need an assembler if you are referring to stitching together R1/R2 reads from a fragment, if they overlap in the middle.
bbmerge.sh
from BBMap suite or FLASH are appropriate tools to do that read merge.If you are referring to assembling two ends of a contig to get a closed circle then that is a different question. Please add some additional explanation by editing this post.
I edited this post. Actually, I sequenced the whole genome of a virus. I want to know whether de novo assembly can give out accurate sequence at the end of the genome.
Besides, the two ends of the genome have relative low coverage than the middle part.
You may want to take a look at
tadpole.sh
from BBMap suite for doing the assembly (denovo-assembly with novel genes inserted ).Tagging: Brian Bushnell
Brian can provide additional insight.
Tadpole seems to do a very good job of viral assembly, so I recommend it for that purpose. Your post's title is misleading, though:
"Assembly of two ends of the sequence"
As far as I can tell, this has nothing to do with your goal, so I suggest you change it. Tadpole can typically generate a good viral assembly from shotgun reads, but your title is completely unrelated to this task.
Edit - I misinterpreted your question. It appears that you want to assemble the outermost ends of your genome... I interpreted as being related to read pairs. Anyway, Tadpole has no problem assembling the ends of viral contigs, as long as they are concordant.
While the ends part is true if OP is willing to try
tadpole
then it would not hurt to do a second assembly with BBtools. Who knows OP may get a better assembly.please provide more details regarding your data.