I am running BAMboozle to anonymize variant sequences using the GRCh37 human reference genome on my bam files. My bam files originally are 2-3 GB but when I get the output bam file from BAMboozle it is 500-600 Kb. Does BAMboozle decrease the size of the bam file. I would say no from what I have looked into it, so is the task not getting completed because the reference genome used is the incorrect one or is there another reason. Appreciate any help.
The tool will only keep primary alignments and will discard unaligned reads. Then it simplifies alignments to be exact matches so there will be a whole lot less data to store, in addition, simpler data compresses much better.
So the output ought to be substantially smaller, and the size reduction would depend on the specifics of the alignments. What percent were primary alignments, how many variants, and how many unaligned reads were there.
Run samtools flagstat before and after bamboozling.
First, thanks for your reply. I ran samtools flagstat before BAMboozle and got the following metrics:
41946670 + 0 in total (QC-passed reads + QC-failed reads)
6556652 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
41935169 + 0 mapped (99.97% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
After BAMboozle, the stats are the following
247042 + 0 in total (QC-passed reads + QC-failed reads)
220252 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
235541 + 0 mapped (95.34% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
This seems odd that the reads that passed QC go from 41 million to 247,042. This is also keeping unmapped reads so I imagine mapped reads are even less. Is this a normal reduction in output from BAMboozle?
BAMboozle is somehow bringing the reads down to a smaller number. You also have a large amount of secondary alignments for these reads. Do you have a link for this package?
Running you above command on my bam file before BAMboozle I get:
35119839
After BAMboozle, the command outputs:
3003
Its confusing because BAMboozle (link provided above in response to GenoMax) just requires the bam file and the reference genome fasta. I have tried GENCODE, NCBI and two other version of GRCh37 HUMAN fasta files and still the output bam file is KiB rather than GiB
I am curious about these two options in BAMboozle. Does this mean it will remove these two types of reads if you don't include those options. Can you try by adding one or both below?
--keepsecondary Keep secondary alignments in output bam file.
--keepunmapped Keep ummapped reads in output bam file.
How many original reads did you have in the fastq files?
yes, if we don't specify this option these reads are not kept. Also I figured out the problem. The fasta file has to be the exact match to the gtf reference genome file that I used to align the samples. So I used GRCh38.101 from ENSEMBL to align my files so I needed to use the fasta file of the version of the genome. I am so sorry for the mistake and am really appreciative of all your guys' help
First, thanks for your reply. I ran samtools flagstat before BAMboozle and got the following metrics:
After BAMboozle, the stats are the following
This seems odd that the reads that passed QC go from 41 million to 247,042. This is also keeping unmapped reads so I imagine mapped reads are even less. Is this a normal reduction in output from BAMboozle?
BAMboozle is somehow bringing the reads down to a smaller number. You also have a large amount of secondary alignments for these reads. Do you have a link for this package?
https://github.com/sandberg-lab/dataprivacy
there is no QC there, that's just a wording,
I don't think the output is correct. frankly, you might be running it incorrectly, or something else is amiss
you can easily check the alignments and see whether it keeps the alignments it should
for example, in my reading, your total number of alignments after bamboozling should be
the above command counts alignments after removing multi-mapped (
-q 1
) unmapped (-F 4
), secondary (-F 256
), and, supplementary (-F 2048
) alignmentsthen on the alignments that it keeps the tool is supposed to replace variants with exact matches the CIGAR should be all
150M
Running you above command on my bam file before BAMboozle I get:
35119839
After BAMboozle, the command outputs:
3003
Its confusing because BAMboozle (link provided above in response to GenoMax) just requires the bam file and the reference genome fasta. I have tried GENCODE, NCBI and two other version of GRCh37 HUMAN fasta files and still the output bam file is KiB rather than GiB
I am curious about these two options in BAMboozle. Does this mean it will remove these two types of reads if you don't include those options. Can you try by adding one or both below?
How many original reads did you have in the fastq files?
yes, if we don't specify this option these reads are not kept. Also I figured out the problem. The fasta file has to be the exact match to the gtf reference genome file that I used to align the samples. So I used GRCh38.101 from ENSEMBL to align my files so I needed to use the fasta file of the version of the genome. I am so sorry for the mistake and am really appreciative of all your guys' help