Question

bbmerge not joining paired-end reads

1

Entering edit mode

7.1 years ago

bioplanet ▴ 60

Hi,

I was using bbmerge with standard parameters to join paired-end sequencing files and it was working very well, e.g. an example:

BBMerge version 37.66
Total time: 17.237 seconds.

 Pairs:                 2478660
 Joined:                2272340     91.676%
 Ambiguous:             197355      7.962%
 No Solution:           8965        0.362%
 Too Short:             0           0.000%

 Avg Insert:            427.5
 Standard Deviation:    8.0
 Mode:                  428

 Insert range:          40 - 481
 90th percentile:       428
 75th percentile:       428
 50th percentile:       428
 25th percentile:       428
 10th percentile:       428

Got some new data now to analyze and, unfortunately, the overall quality was not that good (definitely worse than the first round). I did the trimming but still it did not improve much...Anyway, I was asked to proceed anyway and I need to merge the reads. I tried the same thing, but look at the result:

BBMerge version 37.66
Total time: 2.011 seconds.

 Pairs:                 320821
 Joined:                472         0.147%
 Ambiguous:             320122      99.782%
 No Solution:           227         0.071%
 Too Short:             0           0.000%

 Avg Insert:            424.0
 Standard Deviation:    25.0
 Mode:                  427

 Insert range:          110 - 428
 90th percentile:       427
 75th percentile:       427
 50th percentile:       427
 25th percentile:       427
 10th percentile:       427

As you can see, now bbmerge considers almost all of my reads "ambiguous" and does not join them! Is there a way around it? Maybe by lowering a threshold or something?

bbmerge alignment • 5.8k views

ADD COMMENT • link 7.1 years ago by bioplanet ▴ 60

2

Entering edit mode

Merge first and then trim. That is @Brian's normal recommendation.

That said, if the reads don't overlap (i.e. the inserts are longer than number of sequencing cycles) then merging will not work.

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

Thanks genomax (also upvoted some previous posts). The thing is that even if I try to merge before trimming, they still remain "ambiguous" and discarded. As far as I can tell (don't know if this is correct), merging is basically done by comparing the quality strings on each pair.

For example, the pair below was deemed ambiguous:

@M00316:13:000000000-BDR9L:1:1101:26221:16273 1:N:0:CATGCTTA
AGCCACAACTAGAATGCAGTGAAAAGGGTGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTTAGTTACGCTAGGTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTGCGTTCTCGGTGCTTGGTGCGCCGAGAAGTGGCCAGGAACGACGGCACGCGCCGGGGAAGGCCACTTGTGGGGACGGGGCCAGAGATAGCGTTGTGAGCCAACAGGG
+
>1>A?CFA1CFB113BABB1BA3310000BBBB1DAFBD2BD2DDDDD2BBDD11DD2D1BAD1DD2DD22DA1AD2BB2D11BB01B1111B/B1DBDDDDDBB>BBB0BB1>>>>>/>?/?>0<BB?0BB?0BB11BB0>?/</???FG?<?//<11??/1>><>--<./==<../.../..;-==9-;=-:--;;@@;--;--;-//;:;-;-9-;9;;;-:;;;9/://;;-9-;////;-9;;;-

@M00316:13:000000000-BDR9L:1:1101:26221:16273 2:N:0:CATGCTTA
ACATCACGTCGCACGAGGAGGGCTAGAGCGTCGTGGAGCGGTACGAACGCGCCGGGGGCGGCGGCTCGACCGGCGGGGTGGACGGGCTGTACAAGGGGTGGATTGGGTGAGAGTGAGGTGTGCGGGAGCGGGTGGGTTCGAGCCGGGTCGCGCGATTGTGGTGGGGGGGGTAGTAGCTGCTGTTGGGTGGGGACGATGGGGGGGGCGTGGGGCTGGCGAGTGGGTGTGGCGGGGGGGGGGGGGGGGTGGT
+
>>1>1@@111>>110A0A0AA00000111//BAABB//0/AA/>B//>>>>/>>/>>E//>/---=;:-:-=;-=A--9;;-/;-;;9;;////;-:-;;9;//---;--;//;;/;;/-;;/-;;-;-;;99-;----;99--;-;;999--9-9-;//-;--;-;;9--:-99;////;;9/;;@-;-----;----9---;@----;9--------;9;;;--99:-9;@-;-;9;--;-;;--9--

ADD REPLY • link 7.1 years ago by bioplanet ▴ 60

1

Entering edit mode

You should look at the bbmap/docs/guides/BBMergeGuide.txt included in the distribution.

Is this NextSeq data? Poly-G's in read 2 may be fantom basecalls.

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

Amplicon-seq data, it is the same construct being used (~430 nts long).

Another example:

@M00316:13:000000000-BDR9L:1:1101:22760:19050 1:N:0:CATGCTTA
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAAAACTACGGCACGCTCCGGGGAAAGCGACTTGTGAGTACGAGGCCAGAGATGTCGTTATGACCCAACAGGA
+
A1AAAAA?AFFB1A3BBBB1BBD31B0A0BDBAD1D22D2BADADB1DDBADDDB2DDDDBADDFGHDDA2211BDDGGHD11BB/>B//11@/B1BBBB1BBBBB???BBBB??>/>/>???/B/BB<<BB?B1BB1BB@>?/</<<FGHGGA/>FFF>>>>><>><---..==///::.:::FC==;99;;9..==AG;;-;;-;;;;;/;9;9:9;;;@-9-;;;;;/;/;;;9;;9/9---;9;;-
@M00316:13:000000000-BDR9L:1:1101:22760:19050 2:N:0:CATGCTTA
ACATCACCTGGCACGACGAGGACTAGACCGTCGTGGAGCAGTACGAACGCGCCGAGGGCCGCCACTGCACCGGCGGCATGGACGAGCTGTACAAGGAGTGAATTGATTAAGAGTGAGGTTTGCGGGAGCGGGTTGCTTCGAACCGCCTTGAAGTATTGTCCTCCGCGGGTGAATAGCTCCTGTTGGGGGGTAACGATATGTGGGGCCTCGTACTCGCGAGTGGGTTGGCCCGGGGCGGGCGGTGGGGTTG
+
1>>>>B@111111111AAAAA00B011A1/ABAABB//A/BDB1B//AAA>>>>>>>E??>>///111B10>>>>B//<???1>->-<><1000=.<./=000=00<0:;=;0;=0==/::=0-==9=.=A-9;;;999;9-9-;-;9;;;///;//;//////;:;--;-;//;;/9/9/;/9;;@99-;-/;;-:--////;;--9-;9-9-/----;-;;9-9--/;-;;;;-;9;;9;9;;-;;--

ADD REPLY • link 7.1 years ago by bioplanet ▴ 60

1

Entering edit mode

Hi bioplanet, the first one does not merge in leeHom but the second does:

@M00316:13:000000000-BDR9L:1:1101:22760:19050
AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGC
+
A1AAAAAAFFB1A3BBBB1BBD31B0A0BDBAD1D22D2BADADB1DDBADDDB2DDDDBADDFGHDDA2211BDDGGHD11BB/>B//11@/B1BBBB1BBBBB???BBBB??>/>/>???/B/BB<<BB?B1BB1BB@>?/</<<FGHGGA/>FFF>>>>><>><---..==//%JY..Y%]]&\Z$XKZ%MMM\T;/X=%Z

Even then, it seems really weird. It seems that there is an overlap but it is hard to see visually, and blasting it returns cloning vector.

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

So is this then a software problem? Because it seemed it was working so nicely in the first round of experiments (where I could merge 98% or so of them), so I did not give it too much thought... Is leeHom better then? (I mean, of course for a newly introduced person to genomics like me, I can't really tell, in the paper of BBmerge it said it outperforms everyone else...

ADD REPLY • link 7.1 years ago by bioplanet ▴ 60

0

Entering edit mode

can you check where you think the second example could merge? because I do not really see it

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

If you have bad data going in then no matter what tool you use you are not going to be able to salvage it. Since you say that these are amplicons have you tried to do a multiple sequence alignment with the reference?

Agree with @Gabriel. The read in the example above looks like it is part of some vector. So perhaps not real data?

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

@bioplanet could you try running your data through leeHom and check how it performs?

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Gabriel R. : Why is your merged read smaller than the two original reads?

ADD REPLY • link 7.1 years ago by GenoMax 148k

1

Entering edit mode

the original reads are 250bp and "merged" is 426 for an overlap of 74 in the middle. Even then, this overlap seems sketchy.

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Hmm. Merged read you posted above is only 206 bp long. Can you check?

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

now? AACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCGATAAACAAGTTTAGTTACGCTAGTTTCGCGTACGAAGCCCTGCAGGCTACTTGGCGTCCCTTCTCGGTGCTTTGTACGCCGAAAAATGGCCAAACACCACGGCACGCCCCGGGGAAAGCCACTTGTGAGTACGAGGCCCGAGATATCGTTATGCCCCAACAGGAGCTATTCACCCGCGGAGGACAATACTTCAAGGCGGTTCGAAGCAACCCGCTCCCGCAAACCTCACTCTTAATCAATTCACTCCTTGTACAGCTCGTCCATGCCGCCGGTGCAGTGGCGGCCCTCGGCGCGTTCGTACTGCTCCACGACGGTCTAGTCCTCGTCGTGCCAGGTGATGT

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

2

Entering edit mode

I see the overlap when doing a MSA with your merged read and the two originals. Perhaps your tool is able to do the merge because it is meant for ancient DNA. The overlap is not great though. Considering this is amplicon data such gymnastics should not be needed for the overlap.

ADD REPLY • link 7.1 years ago by GenoMax 148k

1

Entering edit mode

Many thanks to all of you! I will try leeHom and maybe FLASH (just saw this) and see how it looks, will post again!

ADD REPLY • link 7.1 years ago by bioplanet ▴ 60

2

Entering edit mode

If BBMerge is not merging these reads then I would be skeptical if other packages work (leeHom is a special case since it is allowing for multiple mismatches in short stretches to account for ancient/fragmented DNA). You should very carefully check your results.

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

Hello everyone!

So, I also checked FLASH and still I cannot get overlap; so I am guessing it was not a problem of BBmerge, rather there is some problem with the data itself. I will treat them seperately for time being and maybe the next round of experiments will be more informative.

Many thanks to all!

ADD REPLY • link 7.1 years ago by bioplanet ▴ 60

2

Entering edit mode

Please validate comments/answers on your past questions. Otherwise we have no way to know if the solutions suggested worked. Up-voting/accepting answers (green check mark) is the way to do this.