Is it possible to extract the reference sequence from the BWA index files?
2
3
Entering edit mode
8.9 years ago
Alex Richter ▴ 210

I have a series of BWA index files ( *.amb *.ann *.bwt, etc) but the original reference fasta sequences are not available. Assuming that it's a complete burrows-wheeler index, the complete reference should be encoded in those indices.

Does anyone know if/how the fasta can be re-extracted from them?

alignment bwa • 4.4k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
6
Entering edit mode
8.9 years ago
matted 7.8k

This is an interesting question. It definitely should be possible, so I guess the question may be how to do it with existing tools and the least amount of pain.

I looked into it, and it seems that from the original bwtsw code (that bwa relies on, particularly for the indexing steps), the "packed" fasta file (pac) is the key file to use. This is supported by looking at the order of the indexing substeps of fa2pac, pac2bwt, and finally bwt2sa.

So as an informed guess I searched for "pac2fasta", and (surprisingly to me) found an existing utility in the TMAP package. I believe, based on some quick tests, that the binary format for their pac file is the same as bwa uses (and bwtsw), aside from some embedded version numbers. The only catch is that TMAP has a binary annotation file ($ref.tmap.anno) that stores the chromosome names and lengths, whereas bwa uses a plaintext annotation file ($ref.ann). So if this is really what you want to do, it looks like you'll have to hardcode things (that's what I did to test with a small single chromosome reference) or write some code to make a TMAP annotation file from a bam header or bwa .ann file (by working through tmap_refseq_write_header in tmap_refseq.c).

ADD COMMENT
0
Entering edit mode

Thanks! Since I hopefully won't need to do this much, I'll do what you suggest, and munge in the annotations.

ADD REPLY
0
Entering edit mode

Hi Alex, did you find a way to get this done? I also have the same problem. Thanks.

ADD REPLY
0
Entering edit mode

Hi! Did any of you succeed? I would like your help in doing so either. Thanks!

ADD REPLY
0
Entering edit mode

tmap .pac generate by tmap index seen to be double the size, which I think have both forward sequence and it's complementary sequence, as tmap pac2fasta print. tmap $ref.tmap.anno, I think is bwa .amb+.ann; but they do share the same way convert base to "concatenated 2-bit encoded sequence" as comment in bwa source code, bwa.c.

ADD REPLY
1
Entering edit mode
17 months ago
evolighting ▴ 20

It's been a long time, but someone asked me the same question recently, so maybe I can answer it.

Get fasta file from index could be done, as original sequeces is needed at process such as " Generate CIGAR".

bns_get_seq, bns_fetch_seq in bntseq.c is what you want;

And it could be done with scritpt,using the same logic as bwa;

check my test:

$ python3 pac2fasta.py test.fa

9324aced81ecf188650e516992749a48  test.fa.origin
9324aced81ecf188650e516992749a48  test.fa

https://github.com/evolighting/pac2fasta

ADD COMMENT

Login before adding your answer.

Traffic: 2131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6