How to subset bamfile based on chromosome IDs
1
0
Entering edit mode
5.1 years ago
mernster • 0

Hi,

I would like to subset a large bam file so that I can do the variant calling in parallel (using freebayes).

I have tried to subset the bamfile based on subsets of the reference sequence. First, I subset the fasta reference sequence into fasta files containing the equal amount of chromosomes. Then, I used the following command samtools view file.bam -T subset.fasta -b > subset.bam to subset the bam file so that it extracts only the reads that map to the chromosomes included in the respective reference subset:

When I view subset.bam in IGV it only shows the chromosomes that are included in the respective reference sequence which tells me that it worked. However, when I do the variant calling, I get an error saying that freebayes was unable to find a FASTA index entry for the chromosomes that don't belong to that reference subset. Obviously, they are not there because I removed them when I subset the reference.

Any suggestions on what might have gone wrong? If not, does anybody know a different way to subset a bam file into equal parts based on chromosome IDs?

bam subset freebayes • 1.6k views
ADD COMMENT
0
Entering edit mode

might be due to the header present ?

I usually manually add the header to only include those of the sequences in the subset because indeed some software complains on the fact they see sequences being listed in the header for which there is no data in the bam/sam file

extract the headers from the subset bam file and inspect which ones are present .

Personally I loop over all sequences I want to have included in the subset and add data for each one by one. I'm not sure if the -T option is doing what you think it might

ADD REPLY
0
Entering edit mode

Why not just use freebayes-parallel?

Check this github thread: https://github.com/ekg/freebayes/issues/465

ADD REPLY
0
Entering edit mode
5.1 years ago
ATpoint 85k

The command should be e.g. samtools view -b -o subset_chr1.bam file.bam chr1 The purpose of -T is to include the reference fasta sequence into the file, e.g. for writing CRAM format. I think what you did was to include the reference sequence of a specific chromosome into the fasta and that messed things up.

ADD COMMENT

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6