Split Bam Files By Region For Parallel Variant Calling
2
3
Entering edit mode
12.3 years ago
William ★ 5.3k

To speed up variant calling (SNP / Indel) on a set of large BAM files I want to do the variant calling in paralell on a Sun Grid Engine cluster.

I already managed to split the BAMs by reference (chromosome) using https://github.com/pezmaster31/bamtools/ . Afterward I use vcf tools to concatenate the chromosome VCF files to one sample VCF file.

These bamfiles are still relatively big so I would like to split the BAMS further on large N polymer sequences in the reference. I already have a file with the location of these large N polymer sequences in our reference.

Is there a tool that already has the functionality to split BAM files by N polymer sequences?

Or what would be the easiest way to use this region information to split BAM files?

bam • 8.2k views
ADD COMMENT
0
Entering edit mode

Is I/O your limiting factor here? if not, just place a single bam on a network-accessible disk, and use samtools to extract the reads you need. (samtools view file.bam 2:1234-5678 | variantcaller)

ADD REPLY
0
Entering edit mode

I can't pipe the samtools output, I need to pass a bam file path to an 3rd party variant caller.

ADD REPLY
0
Entering edit mode

Not a direct answer but GATK 2 has support for reduced bams now so they might reduce your bam sizes for multi sample calling.

but of course if your issue is getting more chunks to parallelise then I think the samtools -L bedfile seems like a good idea

ADD REPLY
3
Entering edit mode
12.3 years ago
Vivek ★ 2.7k

You don't necessarily have to split the bam file to run your jobs in parallel on the grid. Variant callers like GATK unified genotyper can operate on a single chromosome of a BAM if you just specify the region like "-L chr:1-1000.."

ADD COMMENT
0
Entering edit mode

I thought the iontorrent modified GATK variant caller didn't support this but seems to do work with bed files. So I can use the whole bams together with bed files.

ADD REPLY
2
Entering edit mode
12.3 years ago

You can use samtools: http://samtools.sourceforge.net/

In samtools view, there is a -L parameter that'll allow you to extract reads that overlap with a specified BED file:

-L FILE  output alignments overlapping the input BED FILE

Convert your N-polymer coordinates into a bed file and extract reads that only overlap those coordinates with the -L parameter. You'll have to make a bed file that only has regions without the N-polymer coordinates to extract everything else. I don't think there is an option to exclude regions in samtools.

ADD COMMENT
0
Entering edit mode

Hi Damian,

Your post is almost 3 years old but I am currently thinking about this problem.

If you split your chromosome into 1Kb for example, what about the reads that start in the 0-1Kb chunk and finish in the 1Kb-2Kb chunks? There are in both chunks right? Isn't that a problem? Doesn't it induce errors?

ADD REPLY
0
Entering edit mode

I am not sure how samtools handles the -L parameter or if it even still implements it. If samtools will grab any reads that intersects (totally or partially) the region defined the bed file, then it should be fine. The only problem is that you might get duplicated calls among consecutive region's vcfs which you can check for and take out.

ADD REPLY
0
Entering edit mode

Thanks for the answer.

Yes, then I want to merge my vcfs.

I was just worried that 2 vcfs could disagree about an event (like 0-1Kb chunk vcf says that position 1008 is a sequencing error and 1Kb-2Kb chunk vcf says that it is a SNP).

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6