Question

Ibwa Optimization

1

Entering edit mode

12.6 years ago

dwaggott ▴ 290

iBWA looks Awesome!

A few questions:

Are there any scripts or tool recommendations for creating custom reference/remap files? Specifically, something that takes a VCF as input?
What strategy do you recommend for using custom references based on sample genotypes? Would the primary be hs37lite.fa and the the alternate be all the additional observed alleles similar to the dbsnp137 provided reference? Or should I be doing something more complicated i.e. phasing to get the two alternate haplotype references.
Roughly, what's your variant calling pipeline when using iBWA? Do you think I can drop GATK and the indel realignment step and just use samtools?
Are there any recommended methods/caveats to improve iBWA alignment accuracy (i.e. fastq trim/filter), speed (i.e. pBWA) or bam size (gobiBWA).

Best, Daryl

bwa alignment vcf bam • 2.9k views

ADD COMMENT • link updated 12.6 years ago by tgi.tabbott ▴ 230 • written 12.6 years ago by dwaggott ▴ 290

2

Entering edit mode

For those of us who didn't know, iBWA is "a fork of Heng Li’s BWA aligner with support for iteratively adding alternate haplotypes, reference patches, and variant hypotheses."

ADD REPLY • link 12.6 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

thanks for the clarification

ADD REPLY • link 12.6 years ago by JC 13k

score 3 · Answer 1 · 2013-01-02

(1) Reference/remap files

We use joinx to create these: http://gmt.genome.wustl.edu/joinx/current/. The usage is as follows:

joinx create-contigs -v my_variants.vcf -r my_refseq.fa -o my_new_contigs.fa -R my_new_contigs.fa.remap --flank=99

This creates a new reference/remap pair with one sequence per variant* in myvariants.vcf (variants are relative to myrefseq.fa) with 99bp flanking on either side of the variant.

*The command is currently set up to only create sequences for variants that have an identifier (e.g., rsid). This was fine for making the dbsnp reference but is probably not ideal for general use cases. I will make the identifier requirement optional and maybe add a few more options (like skipping sites that fail filters and some things described in the next point) shortly.

(2) Refs based on sample genotypes

What you said (hs37lite.fa as the primary, output of joinx create-contigs as the alternate) is how we do it. Right now, joinx doesn't look at the genotype data; contigs are created for every alternate allele in the ALT field for each site in the vcf file (whether or not the variants appear in a GT call). I will add some options to do things like only process alleles that are present in GT calls, and maybe allow some basic filtering based on INFO/FORMAT fields (e.g., DP > 20). In any case, I don't think you need to worry about phasing in the vcf sample data (GT=1/2 vs GT=1|2); you will get both sequences created either way.

(3) Variant calling pipeline

I wouldn't change what you're doing right away. I would suggest running things through your existing pipeline to see how the results vary. Most of the testing I have done personally has just used samtools for variant calling after aligning with ibwa (not necessarily because I feel like that is the best thing to do). If you are wanting to generate sequences from existing sample data in a vcf, then I would definitely not suggest simplifying your existing calling strategy for generating the initial set of variants as you will want your alternate hypotheses to be as accurate as possible.

(4) Optimizations

Any pre-processing that works for bwa (trimming, filtering) should work the same for ibwa. The only differences between ibwa and stock bwa 0.5.9 are in sampe, so any methods that speed up "bwa aln" that yield equivalent .sai files are applicable (pbwa might be an option here). GobyBWA looks like it has its own file formats, so that will not work well. Lastly, ibwa sampe does have a -t option to support multi-threading. There have been some other sampe threading patches to bwa that work a bit better (at the expense of using more memory) than what I did in ibwa, but the -t option is worth trying if you become angry about the wall clock time used by ibwa sampe.