Question

splitting a bam file by XA tag into multiple lines

0

Entering edit mode

2.8 years ago

erl1977 ▴ 60

Hi all,

I've been mapping some samples with BWA and have a question as to how to handle the output. Specifically, I am interested in those reads that are mapping to multiple regions of the genome. This is given as the XA tag in the sam/bam file. I would like to be able to split the XA tags up so each mapping location (the supplemental alignments) are all on there own line within the bam file. So it would be something like this:

the original bam output:

MG01HX02:912:H5HNTCCX2:6:1101:7760:13271:N:0:NGACCAAT   16      chr1       1053181 0       16M     *       0       0       TGGTGGAGGCAGAGGN        IAIIIIIIIIFAFAA#        XT:A:R  NM:i:1  X0:i:24 X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:15T0       XA:Z:chr2,+1388224,16M,1;chr5.+479696,16M,1

the new one that I would like to get:

MG01HX02:912:H5HNTCCX2:6:1101:7760:13271:N:0:NGACCAAT   16      chr1       1053181 0       16M     *       0       0       TGGTGGAGGCAGAGGN        IAIIIIIIIIFAFAA#        XT:A:R  NM:i:1  X0:i:24 X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:15T0       
MG01HX02:912:H5HNTCCX2:6:1101:7760:13271:N:0:NGACCAAT   16      chr2       1388224 0       16M     *       0       0       TGGTGGAGGCAGAGGN        IAIIIIIIIIFAFAA#        XT:A:R  NM:i:1  X0:i:24 X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:15T0       
MG01HX02:912:H5HNTCCX2:6:1101:7760:13271:N:0:NGACCAAT   16      chr5       479696 0       16M     *       0       0       TGGTGGAGGCAGAGGN        IAIIIIIIIIFAFAA#        XT:A:R  NM:i:1  X0:i:24 X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:15T0

I can't seem to find a simple way of doing this, or being able to use some of the standard tools out there (samtools, etc.). I could perhaps write something in perl, but have been trying to avoid doing that. Thanks for any help

bam BWA • 1.2k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 2.8 years ago by erl1977 ▴ 60

0

Entering edit mode

looks like a XY problem. What do you want to do after that ?

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

basically, I want to be able to get the number of times each read maps to the genome, and where each read maps to the genome. So, to be a little more specific here, I am actually going to be mapping to a whole host of microbe genomes. I know that most reads will map to multiple microbes. but, if I use a program such as bedtools coverage to count the instances, it will only consider the 'primary' alignment and not the supplemental alignment, yet I want to count each read that goes to each microbe genome. The best way that I can think of doing this is to split the supplemental alignments into separate lines in the bam/sam, and go from there.

Also, I am using bwa all/samse for the mapping as these are short reads (<30nts). So, I do get the XA tag rather than the SA tag which I think comes when bwa mem is used (not totally sure about that).

ADD REPLY • link 2.8 years ago by erl1977 ▴ 60

0

Entering edit mode

these 3 lines should already exist in your BAM file. at least with SA tag this would be the case, XA from what I know is before SA was standardized. try grepping your existing BAM for the read name "MG01HX02:912:H5HNTCCX2:6:1101:7760:13271:N:0:NGACCAAT"

ADD REPLY • link 2.8 years ago by cmdcolin ★ 4.2k