I have a BAM file produced by bowtie with multiple alignments per read (--all option specified). Unfortunately, a tool I use which requires the bwa-flavoured BAM format as input – in particular, it requires that multiple hits are reported in the XA tag of the first read, rather than being reported separately.
Is there a tool that can do the conversion from bowtie to bwa output? I’d write it myself but I’m afraid of missing another subtle difference (i.e. are multiple hits of the same read guaranteed to be consecutive in the output of bowtie? etc.)
@jingtao09 I need it the other way round. :-( Furthermore, that script actually looks for an XA tag that contains the prefix Z: … I’m not sure what that is. It doesn’t seem to be the format I need though.
Z: means the value follows are strings. by look at the definition of XA - chr,pos,CIGAR,NM; I think you can do this,
// keep the last line parsed in record and compare to the current parsed line.
vector=empty
lastline=getline()
while current=getline():
if the current is same as last
assert( current is mapped)
compile chr,pos,CIGAR,NM and append it into a vector
else:
output the lastline with all its XA records in vectors.
empty the vector
last=current
output lastline with XA vector
// note: if the newly compiled sam file doesnt work, you might also make sure the XA:Z tag are in the same column of standard bwa-sam file
In case anybody’s interested, this is the script I ended up using. Unfortunately, it’s extremely slow (I really cannot overstate that) and doesn’t parallelise. A better approach would be to use the .bai index file to collect all reads on a chromosome and run one job per chromosome in parallel.
Furthermore, the output is .sam not .bam, you need to use samtools view -Sb to create a .bam file from that, and provide header information or a reference sequence.
#!/bin/bash
# Read a BAM file produced by Bowtie and produce a BAM file in the format used by BWA.
# Notably, this means accumulating multiple hits of the same read and putting all but
# the first alignment into the 'XA' optional tag of the first hit.
set -e
set -u
function get { cut -f$1; }
infile="$1"
outfile="$2"
rm -f "${outfile}"
prev=''
prevseq=''
more_pos=''
samtools view "${infile}" | while read line; do
seq=$(get 10 <<< "${line}")
if [ "${seq}" != "${prevseq}" ]; then
# Write output.
if [ "${prev}" != "" ]; then
# Get rid of Bowtie-specific 'XA' tag (we simply discard it).
prev="$(sed 's/XA:i:[[:digit:]]*[[:space:]]*//' <<< "${prev}")"
if [ "${more_pos}" = "" ]; then
echo "${prev}"
else
echo "${prev} XA:Z:${more_pos}"
fi
fi
prevseq="${seq}"
prev="${line}"
more_pos=''
else
# Store this hit as an additional hit of the first one with equal sequence.
chr=$(get 3 <<< "${line}")
pos=$(get 4 <<< "${line}")
cigar=$(get 6 <<< "${line}")
nm=$(get 12- <<< "${line}" | sed 's/.*NM:\([^[:space:]]*\).*/\1/')
# Apparently nm is supposed to be a numeric value, not of the format 'i:0' ...
nm=${nm/i:/}
more_pos="${more_pos}${chr},${pos},${cigar},${nm};"
fi
done > "${outfile}"
# Write last sequence.
# Get rid of Bowtie-specific 'XA' tag (we simply discard it).
prev="$(sed 's/XA:i:[[:digit:]]*[[:space:]]*//' <<< "${prev}")"
if [ "${more_pos}" = "" ]; then
echo "${prev}" >> "${outfile}"
else
echo "${prev} XA:Z:${more_pos}" >> "${outfile}"
fi
Here's my python take on this. It has much the same functionality of the bash script posted by Konrad Rudolph. I believe it's faster but at the very least it feels much faster thanks to the progressbar. It should be easy to comment out any of the progressbar lines if you don't want them. It's rather light on checking that the input file formatting is correct.
import progressbar
import re
import subprocess
import sys
infile = open(sys.argv[1])
outfile = open(sys.argv[2], "w")
curID = None
bestHit = None
otherHits = []
def writeRead(read, otherHits):
if "XA" in read: raise Exception("not implemented")
if len(otherHits) > 0:
read = read + "\tXA:Z:{}".format("".join(otherHits))
outfile.write(read+"\n")
# check the length of the input sam file - only important if you want it for the progress meter
print "calculating input sam file size..."
lineCount = subprocess.check_output("wc -l {}".format(sys.argv[1]), shell=True)
lineCount = int(lineCount.strip().split()[0])
print lineCount
pbar = progressbar.ProgressBar(widgets=[progressbar.Percentage(), ' ',
progressbar.Bar(), ' ', progressbar.ETA()], maxval=lineCount).start()
print "converting..."
for i, line in enumerate(infile):
if i % 1000 == 0:
pbar.update(i)
if line.startswith("@"):
outfile.write(line)
continue
fields = line.strip().split()
if curID is not None and fields[0] == curID:
strand = "-" if int(fields[1]) & 0x10 else "+"
nmmatch = re.match(".*NM:i:(\w*).*", line)
if not nmmatch: raise Exception("read missing NM field:", line)
nm = nmmatch.groups(1)[0]
otherHit = "{chr},{strand}{pos},{cigar},{nm};".format(chr=fields[2], strand=strand,
pos=fields[3], cigar=fields[5], nm=nm)
otherHits.append(otherHit)
else:
# write out previous read
if bestHit is not None:
writeRead(bestHit, otherHits)
# save new read
curID = fields[0]
bestHit = line.strip()
otherHits = []
writeRead(bestHit, otherHits)
pbar.finish()
https://github.com/lh3/bwa/blob/master/xa2multi.pl is this waht you want? it expands the xa from bwa sam files.
@jingtao09 I need it the other way round. :-( Furthermore, that script actually looks for an
XA
tag that contains the prefixZ:
… I’m not sure what that is. It doesn’t seem to be the format I need though.Z:
means the value follows are strings. by look at the definition of XA - chr,pos,CIGAR,NM; I think you can do this,@jingtao09 That’s effectively what I’m doing at the moment.