Entering edit mode
7.0 years ago
bioplanet
▴
60
Hi,
I was using bbmerge with standard parameters to join paired-end sequencing files and it was working very well, e.g. an example:
BBMerge version 37.66
Total time: 17.237 seconds.
Pairs: 2478660
Joined: 2272340 91.676%
Ambiguous: 197355 7.962%
No Solution: 8965 0.362%
Too Short: 0 0.000%
Avg Insert: 427.5
Standard Deviation: 8.0
Mode: 428
Insert range: 40 - 481
90th percentile: 428
75th percentile: 428
50th percentile: 428
25th percentile: 428
10th percentile: 428
Got some new data now to analyze and, unfortunately, the overall quality was not that good (definitely worse than the first round). I did the trimming but still it did not improve much...Anyway, I was asked to proceed anyway and I need to merge the reads. I tried the same thing, but look at the result:
BBMerge version 37.66
Total time: 2.011 seconds.
Pairs: 320821
Joined: 472 0.147%
Ambiguous: 320122 99.782%
No Solution: 227 0.071%
Too Short: 0 0.000%
Avg Insert: 424.0
Standard Deviation: 25.0
Mode: 427
Insert range: 110 - 428
90th percentile: 427
75th percentile: 427
50th percentile: 427
25th percentile: 427
10th percentile: 427
As you can see, now bbmerge considers almost all of my reads "ambiguous" and does not join them! Is there a way around it? Maybe by lowering a threshold or something?
Merge first and then trim. That is @Brian's normal recommendation.
That said, if the reads don't overlap (i.e. the inserts are longer than number of sequencing cycles) then merging will not work.
Thanks genomax (also upvoted some previous posts). The thing is that even if I try to merge before trimming, they still remain "ambiguous" and discarded. As far as I can tell (don't know if this is correct), merging is basically done by comparing the quality strings on each pair.
For example, the pair below was deemed ambiguous:
You should look at the
bbmap/docs/guides/BBMergeGuide.txt
included in the distribution.Is this NextSeq data? Poly-G's in read 2 may be fantom basecalls.
Amplicon-seq data, it is the same construct being used (~430 nts long).
Another example:
Hi bioplanet, the first one does not merge in leeHom but the second does:
Even then, it seems really weird. It seems that there is an overlap but it is hard to see visually, and blasting it returns cloning vector.
So is this then a software problem? Because it seemed it was working so nicely in the first round of experiments (where I could merge 98% or so of them), so I did not give it too much thought... Is leeHom better then? (I mean, of course for a newly introduced person to genomics like me, I can't really tell, in the paper of BBmerge it said it outperforms everyone else...
can you check where you think the second example could merge? because I do not really see it
If you have bad data going in then no matter what tool you use you are not going to be able to salvage it. Since you say that these are amplicons have you tried to do a multiple sequence alignment with the reference?
Agree with @Gabriel. The read in the example above looks like it is part of some vector. So perhaps not real data?
@bioplanet could you try running your data through leeHom and check how it performs?
Gabriel R. : Why is your merged read smaller than the two original reads?
the original reads are 250bp and "merged" is 426 for an overlap of 74 in the middle. Even then, this overlap seems sketchy.
Hmm. Merged read you posted above is only 206 bp long. Can you check?
now? AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGCCACTTGTGAGTACGAGGCCCGAGATATCGTTATGCCCCAACAGGAGCTATTCACCCGCGGAGGACAATACTTCAAGGCGGTTCGAAGCAACCCGCTCCCGCAAACCTCACTCTTAATCAATTCACTCCTTGTACAGCTCGTCCATGCCGCCGGTGCAGTGGCGGCCCTCGGCGCGTTCGTACTGCTCCACGACGGTCTAGTCCTCGTCGTGCCAGGTGATGT
I see the overlap when doing a MSA with your merged read and the two originals. Perhaps your tool is able to do the merge because it is meant for ancient DNA. The overlap is not great though. Considering this is amplicon data such gymnastics should not be needed for the overlap.
Many thanks to all of you! I will try leeHom and maybe FLASH (just saw this) and see how it looks, will post again!
If BBMerge is not merging these reads then I would be skeptical if other packages work (leeHom is a special case since it is allowing for multiple mismatches in short stretches to account for ancient/fragmented DNA). You should very carefully check your results.
Hello everyone!
So, I also checked FLASH and still I cannot get overlap; so I am guessing it was not a problem of BBmerge, rather there is some problem with the data itself. I will treat them seperately for time being and maybe the next round of experiments will be more informative.
Many thanks to all!
Please validate comments/answers on your past questions. Otherwise we have no way to know if the solutions suggested worked. Up-voting/accepting answers (green check mark) is the way to do this.
out of curiosity, could you post the data somewhere?