Hello,
I did a demultiplexing analysis on a sequencing PacBio data file. That gives me back 20 BAM files corresponding to 20 bacteria. That analysis gives also a file :
IdxFirst IdxCombined IdxFirstNamed IdxCombinedNamed Counts MeanScore
6 6 bc1008 bc1008 62939 73
20 20 bc1023 bc1023 51303 70
21 21 bc1024 bc1024 62978 69
22 22 bc1026 bc1026 48417 70
23 23 bc1027 bc1027 17737 70
24 24 bc1028 bc1028 34801 71
25 25 bc1029 bc1029 38043 67
27 27 bc1031 bc1031 113230 69
....
For example, for the first bacteria bc1008
, it founds 62939 corresponding to 62939 contigs.
Then, I converted the BAM files in FASTA. I used gtseq stat
from the genometools
library on each file to get more statistics (N50, mean size...). For the first file corresponding to the first bacteria (bc1008
), I get :
# number of contigs: 222576
# total contigs length: 2071178900
# mean contig size: 9305.49
# contig size first quartile: 6629
# median contig size: 8811
# contig size third quartile: 11619
# longest contig: 113113
# shortest contig: 51
# contigs > 500 nt: 217752 (97.83 %)
# contigs > 1K nt: 214604 (96.42 %)
# contigs > 10K nt: 84423 (37.93 %)
# contigs > 100K nt: 6 (0.00 %)
# contigs > 1M nt: 0 (0.00 %)
# N50: 10722
# L50: 70163
# N80: 7754
# L80: 138032
It founds 222576 contigs, what is totally different of the number of contigs found by the demultiplexing analysis. I can't figure out why...
Any suggestion?
Did you use PacBio tools, bam2fastx and lima, for this conversion/demultiplexing? If not I would recommend using those specific tools.
I used the
samtools
package to do the conversion between the BAM and the FASTA files. I try with bam2fastx and I tell you if it is good.