Question

Convert Bowtie Output To Bwa

3

Entering edit mode

12.4 years ago

Konrad Rudolph ▴ 140

I have a BAM file produced by bowtie with multiple alignments per read (--all option specified). Unfortunately, a tool I use which requires the bwa-flavoured BAM format as input – in particular, it requires that multiple hits are reported in the XA tag of the first read, rather than being reported separately.

Is there a tool that can do the conversion from bowtie to bwa output? I’d write it myself but I’m afraid of missing another subtle difference (i.e. are multiple hits of the same read guaranteed to be consecutive in the output of bowtie? etc.)

bowtie bwa sam • 4.1k views

ADD COMMENT • link updated 11.1 years ago by Walter • 0 • written 12.4 years ago by Konrad Rudolph ▴ 140

1

Entering edit mode

https://github.com/lh3/bwa/blob/master/xa2multi.pl is this waht you want? it expands the xa from bwa sam files.

ADD REPLY • link 12.4 years ago by jingtao09 ▴ 110

0

Entering edit mode

@jingtao09 I need it the other way round. :-( Furthermore, that script actually looks for an XA tag that contains the prefix Z: … I’m not sure what that is. It doesn’t seem to be the format I need though.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 12.4 years ago by Konrad Rudolph ▴ 140

0

Entering edit mode

Z: means the value follows are strings. by look at the definition of XA - chr,pos,CIGAR,NM; I think you can do this,

// keep the last line parsed in record and compare to the  current parsed line.
vector=empty
lastline=getline()
while current=getline():
     if the current is same as last
        assert( current is  mapped) 
        compile chr,pos,CIGAR,NM and append it into a vector
     else:
        output the lastline with all its XA records in vectors.
        empty the vector
     last=current

output lastline with XA vector
//  note: if the newly compiled sam file doesnt work, you might also make sure the XA:Z tag are in the same column of standard bwa-sam file

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 12.4 years ago by jingtao09 ▴ 110

0

Entering edit mode

@jingtao09 That’s effectively what I’m doing at the moment.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 12.4 years ago by Konrad Rudolph ▴ 140

Ram · Answer 1 · 2012-06-29

In case anybody’s interested, this is the script I ended up using. Unfortunately, it’s extremely slow (I really cannot overstate that) and doesn’t parallelise. A better approach would be to use the .bai index file to collect all reads on a chromosome and run one job per chromosome in parallel.

Furthermore, the output is .sam not .bam, you need to use samtools view -Sb to create a .bam file from that, and provide header information or a reference sequence.

#!/bin/bash

# Read a BAM file produced by Bowtie and produce a BAM file in the format used by BWA.
# Notably, this means accumulating multiple hits of the same read and putting all but
# the first alignment into the 'XA' optional tag of the first hit.

set -e
set -u

function get { cut -f$1; }

infile="$1"
outfile="$2"

rm -f "${outfile}"

prev=''
prevseq=''
more_pos=''

samtools view "${infile}" | while read line; do
    seq=$(get 10 <<< "${line}")

    if [ "${seq}" != "${prevseq}" ]; then
        # Write output.
        if [ "${prev}" != "" ]; then
            # Get rid of Bowtie-specific 'XA' tag (we simply discard it).
            prev="$(sed 's/XA:i:[[:digit:]]*[[:space:]]*//' <<< "${prev}")"
            if [ "${more_pos}" = "" ]; then
                echo "${prev}"
            else
                echo "${prev} XA:Z:${more_pos}"
            fi
        fi
        prevseq="${seq}"
        prev="${line}"
        more_pos=''
    else
        # Store this hit as an additional hit of the first one with equal sequence.
        chr=$(get 3 <<< "${line}")
        pos=$(get 4 <<< "${line}")
        cigar=$(get 6 <<< "${line}")
        nm=$(get 12- <<< "${line}" | sed 's/.*NM:\([^[:space:]]*\).*/\1/')
        # Apparently nm is supposed to be a numeric value, not of the format 'i:0' ...
        nm=${nm/i:/}
        more_pos="${more_pos}${chr},${pos},${cigar},${nm};"
    fi
done > "${outfile}"

# Write last sequence.

# Get rid of Bowtie-specific 'XA' tag (we simply discard it).
prev="$(sed 's/XA:i:[[:digit:]]*[[:space:]]*//' <<< "${prev}")"
if [ "${more_pos}" = "" ]; then
    echo "${prev}" >> "${outfile}"
else
    echo "${prev} XA:Z:${more_pos}" >> "${outfile}"
fi

Ram · Answer 2 · 2013-10-24

Here's my python take on this. It has much the same functionality of the bash script posted by Konrad Rudolph. I believe it's faster but at the very least it feels much faster thanks to the progressbar. It should be easy to comment out any of the progressbar lines if you don't want them. It's rather light on checking that the input file formatting is correct.

import progressbar
import re
import subprocess
import sys

infile = open(sys.argv[1])
outfile = open(sys.argv[2], "w")

curID = None
bestHit = None
otherHits = []

def writeRead(read, otherHits):
    if "XA" in read: raise Exception("not implemented")

    if len(otherHits) > 0:
        read = read + "\tXA:Z:{}".format("".join(otherHits))
    outfile.write(read+"\n")


# check the length of the input sam file - only important if you want it for the progress meter
print "calculating input sam file size..."
lineCount = subprocess.check_output("wc -l {}".format(sys.argv[1]), shell=True)
lineCount = int(lineCount.strip().split()[0])
print lineCount

pbar = progressbar.ProgressBar(widgets=[progressbar.Percentage(), ' ', 
    progressbar.Bar(), ' ', progressbar.ETA()], maxval=lineCount).start()

print "converting..."

for i, line in enumerate(infile):
    if i % 1000 == 0:
        pbar.update(i)

    if line.startswith("@"):
        outfile.write(line)
        continue

    fields = line.strip().split()
    if curID is not None and fields[0] == curID:
        strand = "-" if int(fields[1]) & 0x10 else "+"

        nmmatch = re.match(".*NM:i:(\w*).*", line)
        if not nmmatch: raise Exception("read missing NM field:", line)
        nm = nmmatch.groups(1)[0]

        otherHit = "{chr},{strand}{pos},{cigar},{nm};".format(chr=fields[2], strand=strand, 
            pos=fields[3], cigar=fields[5], nm=nm)

        otherHits.append(otherHit)
    else:
        # write out previous read
        if bestHit is not None:
            writeRead(bestHit, otherHits)

        # save new read
        curID = fields[0]
        bestHit = line.strip()
        otherHits = []

writeRead(bestHit, otherHits)

pbar.finish()