Question

Recreate reference fasta from bam file

1

Entering edit mode

6.8 years ago

Chris Miller 22k

To make a long story short, I have a bam file that is aligned to the human reference plus a specific contig (human gene + viral vector introduced into a mouse).

The fasta file that was used for alignment got inadvertently deleted, but I have good coverage across the contig. Is there a straightforward way to use the reads aligning to that contig to recreate the contig's fasta file? (All the info should be there!)

bam fasta reference • 3.6k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.8 years ago by Chris Miller 22k

0

Entering edit mode

Interesting question! I'd love to see what people think of this!

ADD REPLY • link 6.8 years ago by Ram 44k

0

Entering edit mode

reformat.sh in=your.bam out=fasta.fa (from BBMap suite) will create a fasta file. I am not sure if it will be the original reference you wish to recover. You could test with a small(ish) example.

ADD REPLY • link 6.8 years ago by GenoMax 149k

0

Entering edit mode

That gives each read in fasta format, not the entire contig.I guess that could be assembled in subsequent steps or something..

ADD REPLY • link 6.8 years ago by Chris Miller 22k

0

Entering edit mode

You are right. Sorry a mental lapse on my part.

Do you have the index for what ever aligner that was used? I wonder if there is a way to recreate the fasta from it.

ADD REPLY • link 6.8 years ago by GenoMax 149k

0

Entering edit mode

Nope - the directory with the fasta and the indices was nuked. FWIW, I'm pretty sure that I can recreate this fastq by digging up the vector and gene sequences and spending an hour or two doing manual surgery. It just seems like there oughta be a better way!

ADD REPLY • link 6.8 years ago by Chris Miller 22k

1

Entering edit mode

6.8 years ago

h.mon 35k

The easiest I can think of is extract the reads from the region and reassemble, but this would recreate a "wrong" reference.

Another option would be to recreate the region using the CIGAR field. Of course, this has already been implemented by Pierre a long time ago - the surprise here is it isn't java. This implementation apparently works for simple CIGAR.

There is another thread discussing the issue ( Is it possible to reconstruct alignment from CIGAR and MD strings alone? ), where Istvam mentions a thread announcing sam2pariwise - maybe it will work for you? This thread indicates that the CIGAR + MD are required for unambiguous reference reconstruction.

ADD COMMENT • link 6.8 years ago by h.mon 35k

GenoMax · Accepted Answer · 2018-05-09

2

Entering edit mode

6.8 years ago

microfuge ★ 2.0k

Although reassembly seems the preferred option I thought if Pilon which polishes a reference with standard illumina reads could work . Just for fun I tried following with an ecoli MG1655 bam file.

Got the length and name of reference chromosome from bam header (in this case only one)
Created a fake reference with the same name and same number of nucleotides (but all nucleotides were A ; I tried with N and it did not work)
Then use pilon and gave the input bam and the fake reference (composed of all A nucleotides) ( I gave as a single end bam but paired should produce better results)
- Pilon generated a fasta (pilon.fasta default name)
- Ran blast2sequence (NCBI;Online) with actual ecoli MG1655 genome)
- Following were the results (99% coverage and 99% identity)
- Dot Plot (https://ibb.co/ifFzjS)

- Alignment details (https://ibb.co/hNCejS)

ADD COMMENT • link updated 6.8 years ago by GenoMax 149k • written 6.8 years ago by microfuge ★ 2.0k

0

Entering edit mode

Thanks for that test. May be useful to @Chris.

Problem is one can't be 100% sure (which you have confirmed) that you recovered the exact fasta genome used originally. But this is close enough.