BWA 0.7.12 and Unique reads, -r, -c options, XA: and SA: optional tags on SAM output,
2
3
Entering edit mode
9.4 years ago
theodore ▴ 90

So, I am using the following version of BWA

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039

I got through a lot threads read answers and so on... still I am confused.

I am performing ChIP or 4C seq. I get single end reads. I would like to get uniquely mapped reads. For the 4C I am trimming the viewpoint.

This is the command I am using:

bwa mem -t 8 -M -k 19 -r 1 -c 1 $ref_genome $1/$base1".trimmed" > $3/$NOEXT1".sam"

and those are the explanation for every string I included:

  • -c keep those that have unique alignment (Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000])
  • -r by reducing it to 1, I get bwa mem to work things out more intensively, like the --best command for bowtie (Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy)
  • -k seed length (I thing that 32 is to big????)
  • -M keep it compatible with picard tools
  • -t threads to use

What are your opinion on the -c and -r strings?

Then I get to see the produced file and it has sequences that are marked with the optional field of XA:Z and SA:Z

XA: Alternative hits; format: (chr,pos,CIGAR,NM;)

EDIT: In any case I have failed to find the XT:A:U optional flag

SA: Z, not sure what it is but, it almost always coincides with the 256 flag = not primary alignment so I believe that this is NOT a unique aligned read.

The question is mostly on XA:. There is no XA:U if it exist it is XA:Z and is followed by 3 or more genomic coordinates, and that makes me believe that this is not a uniquely mapped read.

To wrap it up I remove from the SAM file with the following:

samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam"
  • -F 4: remove not mapped
  • -F 256: remove not primary aligned
  • -q 1: remove reads with MAPQ quality less than 1 (I thing that this is quite necessary)

I am rephrasing.

Does anyone have any info our input regarding the SA: and XA: tags, especially in the absence of the XA:U tag? Does the removal of those tags makes sense or I am loosing valuable reads?

EDIT:

Here are some reads as an example:

Those include both the SA: and the XA optional flag

BIOMICS-HISEQHI:522:HCYM5ADXX:2:2115:14837:88360        16      chr12   87311997        1       34M17S  *       0       0       TCTATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA     BBBBBBABB>A<B>B>A?7BBB?A=0BBA?A?BBBBBBBA=BBBBBBBB?B     NM:i:1  MD:Z:0A33       AS:i:33 XS:i:31 SA:Z:chr2,161581931,+,32M19S,1,0;       XA:Z:chr9,-104599379,51M,5;chr3,+170653467,51M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1102:3525:68438 16      chr12   87311999        0       32M17S  *       0       0       TATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA       AA==5.DB8BB8=/??*<D99D??9?D<BEBD@FDD????E<E?E<CAC       NM:i:0  MD:Z:32 AS:i:32 XS:i:30 SA:Z:chr2,161581931,+,32M17S,0,0;       XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;chr12,+46991828,49M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1209:6904:71888 256     chr17   59076122        0       33M18H  *       0       0       TTTTCCCATTCTGTAGATTGTCTGTTTACTCTG       IGGIIIIIIIIGA@GGGHFIIIIIFFHEIHIHF       NM:i:0  MD:Z:33 AS:i:33 XS:i:32 SA:Z:chr11,22585338,+,18S33M,0,0;       XA:Z:chr4,-151801895,19S32M,0;
(END)

Those got only the XA: optional flag

BIOMICS-HISEQHI:522:HCYM5ADXX:2:2104:11639:55952        0       chr1    8120580 1       15S31M5S        *       0       0       TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCTTGGG     GIIIGCGFGCGHEGFHIICCGIHDD>HB<CHHGAHHIIHHEHHFFFFFEDD     NM:i:0  MD:Z:31 AS:i:31 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr20,+51943978,23S28M,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chrX,-70173856,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2105:16548:32340        0       chr1    8120580 6       15S34M  *       0       0       CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTG       HHIIHIEIBF1?DHGGDDDF<FB?BGIHI>FGI<EHC@CCHEFE>@BCC       NM:i:0  MD:Z:34 AS:i:34 XS:i:30 XA:Z:chr8,+98304845,30M19S,0;chr16,+70407077,14S35M,1;chr8,+55962122,15S34M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2106:5931:90491 0       chr1    8120580 6       14S32M  *       0       0       CTTGGGATGTACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCC  @H4EC381)*1*:?B:*?:0:?BD?G*?8'-<-5C4=;;D.?7A>E  NM:i:0  MD:Z:32 AS:i:32 XS:i:28 XA:Z:chr16,+70407077,13S33M,1;chr8,+55962122,14S32M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2108:7261:70911 0       chr1    8120580 0       15S36M  *       0       0       TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGTCTGGG     :>33@@1(.=?1>99?3.=33.6=3:>)=;533-5=9;;3>9?;?8=?337     NM:i:2  MD:Z:30C3A1     AS:i:30 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr1,-8936944,28M23S,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chr10,-95309935,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2113:12753:98209        0       chr1    8120580 9       15S36M  *       0       0       CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTGAG     ><6)<2:86-97;3:?>?==<?>?1=?3>75;?3=?;?;=???;?==>7;>     NM:i:0  MD:Z:36 AS:i:36 XS:i:30 XA:Z:chr8,+98304845,30M21S,0;chr16,+70407077,14S37M,2;

Any help would be appreciated

BWA SAM • 24k views
ADD COMMENT
0
Entering edit mode

No one? Anyone?

ADD REPLY
0
Entering edit mode

I will probably also post this question over to SeqAnswers

ADD REPLY
0
Entering edit mode

Personally I find your post quite confusing, it's unclear to me what you are asking... To the question "What are your opinion on the -c and -r strings? I haven't played with them but default seems ok for my needs. "Am, I missing something, Do I do something wrong?" hard to tell without more background about your task...

ADD REPLY
0
Entering edit mode

what would you think a proper background would be?

experiment?

propose of the analysis?

regarding the "what am I missing" question, it refers mostly to the observation of the existence of the SA:Z and XA:Z, optional flags, although I have followed the recommendations described in nummerous posts in the forums.

I could add that I have single end reads...

ADD REPLY
0
Entering edit mode

Dear Theodore;

I am having equal difficulty filtering for uniquely aligned reads using bwa0.7.10. I simply want to extract all reads that align exactly once to the genome. In STAR, MAPQ is calculated as a function of how many times it can be potentially aligned, so I can simply filter based on that. I'm not sure what strategy is appropriate for bwa, where MAPQ is calculated differently. Would outputing only the highest alignments do the trick? I am hoping you have made progress on this front.

Regards,
Julien

ADD REPLY
0
Entering edit mode

do you have figured out the question to extract all reads that align exactly once to the genome successfully ?

ADD REPLY
0
Entering edit mode

if I am not mistaken the following as noted above should do it

try to map with:

bwa mem -t 8 -M -k 19 -r 1 -c 1

and then from the sam file, before converting it to bam do:

samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam"

the -F 256 applies only if the -M has been used during mapping

I hope that it helps

ADD REPLY
4
Entering edit mode
9.0 years ago
Carambakaracho ★ 3.3k

Not sure whether I understood your original question, but in case what you're asking is "what do the XA and SA flags mean" this is what bwa announced:

Release 0.7.5 (29 May, 2013):

  • Other hits part of a chimeric alignment are now reported in the SA tag, conforming to the SAM spec v1.5. (See also SAM format specifications)

Release 0.7.9 (19 May, 2014):

  • Output alternative hits in the XA tag if there are not so many of them. This is a BWA-backtrack feature

It took me a while to figure this out as well.

ADD COMMENT
2
Entering edit mode
6.4 years ago
Samir ▴ 210

Copied from my reply at an another forum


For much faster and stable$ implementation, sambamba can handle this using following one-liner:

sambamba view -t 12 -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" test.bwa-mem.bam -o test-uniq.bam

See manpage and sambamba synatx for filters for details on further usage.

$: Besides grep based search being slow, it may return false positive hits (@gringer's reply above). I also do not find awk based approach reliable because fields like XA, SA, etc. are optional fields in SAM format, and not positionally fixed fields.

ADD COMMENT

Login before adding your answer.

Traffic: 2214 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6