BBMap and Samtools show different values for mean inner distance between mate pairs
0
0
Entering edit mode
5.6 years ago
hopedwall • 0

I'm working on a project that uses paired-end samples where the inner distance between mate pairs is not provided, so I tried using BBMap (as suggested in other posts) in order to get an estimate. However, when I tried using samtools like this samtools stats output.sam | grep "insert size average" I got a different result. Am I doing something wrong? Thanks!

rna-seq next-gen • 1.5k views
ADD COMMENT
0
Entering edit mode

Could you please show us the result of both tools?

ADD REPLY
0
Entering edit mode

Sure, here it is.

BBMap output:

Pairing data:           pct pairs       num pairs       pct bases          num bases

mated pairs:             47.9200%            9584        47.9200%            1916800
bad pairs:               51.3550%           10271        51.3550%            2054200
insert size avg:          590.76


Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                  99.9300%           19986        99.9300%            1998600
unambiguous:              1.9000%             380         1.9000%              38000
ambiguous:               98.0300%           19606        98.0300%            1960600
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:       74.4050%           14881        74.4050%            1488100
semiperfect site:        74.7200%           14944        74.7200%            1494400
rescued:                  2.0550%             411

Match Rate:                   NA               NA         6.0331%            1990950
Error Rate:              25.5429%            5105        93.9669%           31009376
Sub Rate:                 6.6296%            1325         0.0190%               6264
Del Rate:                20.8246%            4162        93.9437%           31001726
Ins Rate:                 1.6211%             324         0.0042%               1386
N Rate:                   0.0000%               0         0.0000%                  0


Read 2 data:            pct reads       num reads       pct bases          num bases

mapped:                  99.3450%           19869        99.3450%            1986900
unambiguous:              2.0800%             416         2.0800%              41600
ambiguous:               97.2650%           19453        97.2650%            1945300
low-Q discards:           0.0000%               0         0.0000%                  0

perfect best site:       69.2500%           13850        69.2500%            1385000
semiperfect site:        69.2500%           13850        69.2500%            1385000
rescued:                  4.6350%             927

Match Rate:                   NA               NA         5.0753%            1967294
Error Rate:              30.2934%            6019        94.9247%           36794728
Sub Rate:                12.4868%            2481         0.0463%              17952
Del Rate:                21.6267%            4297        94.8741%           36775122
Ins Rate:                 1.7615%             350         0.0043%               1654
N Rate:                   0.0000%               0         0.0000%                  0

Total time:             42.471 seconds.

Samtools output:

SN      insert size average:    2113.6
ADD REPLY
0
Entering edit mode

Hello again,

please do not post screenshots to show us the output of a program or any file content. Instead copy&paste the text and use the formatting bar (especially the code option) to present your post better.

code_formatting

Thanks!

fin swimmer

ADD REPLY
0
Entering edit mode

Sorry about that, I'll edit asap! [edit: done]

ADD REPLY
0
Entering edit mode

What command line did you use for BBMap? Looks like you are using a small subset of reads for this estmate? Since you are using actual alignment information statistics reported by BBMap should be accurate.

That said following statistic is not very good.

bad pairs:               51.3550%

Something seems to be off with your data. Are these data trimmed as pair?

You can also create a histogram of the insert sizes using

ihist=<file>            Write histogram of insert sizes (for paired reads).
ADD REPLY
0
Entering edit mode

I used BBMap in this way:

bbmap.sh in=read_1.fa in2=read_2.fa ref=genome.fa out=bbout.sam

I'm using simulated reads, generated from a single gene. Could this be the issue? What surprises me is that the value returned by samtools is completely different from the one returned by BBMap. I need the average distance between pairs in order to do some further analysis, but I'm not so sure on which one to choose.

ADD REPLY
0
Entering edit mode

I'm using simulated reads, generated from a single gene. Could this be the issue?

Possibly. How did you generate the simulated reads? Were they generated from that single gene sequence? Is genome.fa real full genome? It is possible that some reads generated from single gene may be mapping to places they did not originate from which may be leading to that strange number from samtools.

I need the average distance between pairs in order to do some further analysis, but I'm not so sure on which one to choose.

Value BBMap is giving you should be from actual alignments. I am not sure how Samtools calculates the value it is reporting.

ADD REPLY
0
Entering edit mode

I'm using this repository to generate rna-seq paired-end samples, from gene ENSG00000280145, human chromosome 21. Since I'm running BBMap locally on my machine, I needed shorter reads.

ADD REPLY
0
Entering edit mode

BBTools has a tool called randomreads.sh that can simulate data. You can give that a try. I would suggest simulating reads from entire chromosome 21 to be a bit more realistic about what the real data will look like and to align simulated reads to that chromosome. randomreads.sh will allow you to generate reads that fit a profile (and even an insert size range).

What are you ultimately planning to do with average insert sizes?

ADD REPLY
0
Entering edit mode

Thanks, I'll give it a try! I'm working on Alternative Splicing event detection from paired-end samples and figured the average insert size might be useful.

ADD REPLY

Login before adding your answer.

Traffic: 1671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6