Hi,
I would like to subset a large bam file so that I can do the variant calling in parallel (using freebayes).
I have tried to subset the bamfile based on subsets of the reference sequence. First, I subset the fasta reference sequence into fasta files containing the equal amount of chromosomes. Then, I used the following command
samtools view file.bam -T subset.fasta -b > subset.bam
to subset the bam file so that it extracts only the reads that map to the chromosomes included in the respective reference subset:
When I view subset.bam in IGV it only shows the chromosomes that are included in the respective reference sequence which tells me that it worked. However, when I do the variant calling, I get an error saying that freebayes was unable to find a FASTA index entry for the chromosomes that don't belong to that reference subset. Obviously, they are not there because I removed them when I subset the reference.
Any suggestions on what might have gone wrong? If not, does anybody know a different way to subset a bam file into equal parts based on chromosome IDs?
might be due to the header present ?
I usually manually add the header to only include those of the sequences in the subset because indeed some software complains on the fact they see sequences being listed in the header for which there is no data in the bam/sam file
extract the headers from the subset bam file and inspect which ones are present .
Personally I loop over all sequences I want to have included in the subset and add data for each one by one. I'm not sure if the
-T
option is doing what you think it mightWhy not just use freebayes-parallel?
Check this github thread: https://github.com/ekg/freebayes/issues/465