So, I am using the following version of BWA
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
I got through a lot threads read answers and so on... still I am confused.
I am performing ChIP or 4C seq. I get single end reads. I would like to get uniquely mapped reads. For the 4C I am trimming the viewpoint.
This is the command I am using:
bwa mem -t 8 -M -k 19 -r 1 -c 1 $ref_genome $1/$base1".trimmed" > $3/$NOEXT1".sam"
and those are the explanation for every string I included:
-c
keep those that have unique alignment (Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000])-r
by reducing it to 1, I get bwa mem to work things out more intensively, like the --best command for bowtie (Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy)-k
seed length (I thing that 32 is to big????)-M
keep it compatible with picard tools-t
threads to use
What are your opinion on the -c
and -r
strings?
Then I get to see the produced file and it has sequences that are marked with the optional field of XA:Z
and SA:Z
XA
: Alternative hits; format: (chr,pos,CIGAR,NM;)
EDIT: In any case I have failed to find the XT:A:U
optional flag
SA: Z
, not sure what it is but, it almost always coincides with the 256 flag = not primary alignment so I believe that this is NOT a unique aligned read.
The question is mostly on XA:
. There is no XA:U
if it exist it is XA:Z
and is followed by 3 or more genomic coordinates, and that makes me believe that this is not a uniquely mapped read.
To wrap it up I remove from the SAM file with the following:
samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam"
-F 4
: remove not mapped-F 256
: remove not primary aligned-q 1
: remove reads with MAPQ quality less than 1 (I thing that this is quite necessary)
I am rephrasing.
Does anyone have any info our input regarding the SA:
and XA:
tags, especially in the absence of the XA:U
tag? Does the removal of those tags makes sense or I am loosing valuable reads?
EDIT:
Here are some reads as an example:
Those include both the SA:
and the XA
optional flag
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2115:14837:88360 16 chr12 87311997 1 34M17S * 0 0 TCTATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA BBBBBBABB>A<B>B>A?7BBB?A=0BBA?A?BBBBBBBA=BBBBBBBB?B NM:i:1 MD:Z:0A33 AS:i:33 XS:i:31 SA:Z:chr2,161581931,+,32M19S,1,0; XA:Z:chr9,-104599379,51M,5;chr3,+170653467,51M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1102:3525:68438 16 chr12 87311999 0 32M17S * 0 0 TATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA AA==5.DB8BB8=/??*<D99D??9?D<BEBD@FDD????E<E?E<CAC NM:i:0 MD:Z:32 AS:i:32 XS:i:30 SA:Z:chr2,161581931,+,32M17S,0,0; XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;chr12,+46991828,49M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1209:6904:71888 256 chr17 59076122 0 33M18H * 0 0 TTTTCCCATTCTGTAGATTGTCTGTTTACTCTG IGGIIIIIIIIGA@GGGHFIIIIIFFHEIHIHF NM:i:0 MD:Z:33 AS:i:33 XS:i:32 SA:Z:chr11,22585338,+,18S33M,0,0; XA:Z:chr4,-151801895,19S32M,0;
(END)
Those got only the XA:
optional flag
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2104:11639:55952 0 chr1 8120580 1 15S31M5S * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCTTGGG GIIIGCGFGCGHEGFHIICCGIHDD>HB<CHHGAHHIIHHEHHFFFFFEDD NM:i:0 MD:Z:31 AS:i:31 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr20,+51943978,23S28M,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chrX,-70173856,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2105:16548:32340 0 chr1 8120580 6 15S34M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTG HHIIHIEIBF1?DHGGDDDF<FB?BGIHI>FGI<EHC@CCHEFE>@BCC NM:i:0 MD:Z:34 AS:i:34 XS:i:30 XA:Z:chr8,+98304845,30M19S,0;chr16,+70407077,14S35M,1;chr8,+55962122,15S34M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2106:5931:90491 0 chr1 8120580 6 14S32M * 0 0 CTTGGGATGTACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCC @H4EC381)*1*:?B:*?:0:?BD?G*?8'-<-5C4=;;D.?7A>E NM:i:0 MD:Z:32 AS:i:32 XS:i:28 XA:Z:chr16,+70407077,13S33M,1;chr8,+55962122,14S32M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2108:7261:70911 0 chr1 8120580 0 15S36M * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGTCTGGG :>33@@1(.=?1>99?3.=33.6=3:>)=;533-5=9;;3>9?;?8=?337 NM:i:2 MD:Z:30C3A1 AS:i:30 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr1,-8936944,28M23S,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chr10,-95309935,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2113:12753:98209 0 chr1 8120580 9 15S36M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTGAG ><6)<2:86-97;3:?>?==<?>?1=?3>75;?3=?;?;=???;?==>7;> NM:i:0 MD:Z:36 AS:i:36 XS:i:30 XA:Z:chr8,+98304845,30M21S,0;chr16,+70407077,14S37M,2;
Any help would be appreciated
No one? Anyone?
I will probably also post this question over to SeqAnswers
Personally I find your post quite confusing, it's unclear to me what you are asking... To the question "What are your opinion on the
-c
and-r
strings? I haven't played with them but default seems ok for my needs. "Am, I missing something, Do I do something wrong?" hard to tell without more background about your task...what would you think a proper background would be?
experiment?
propose of the analysis?
regarding the "what am I missing" question, it refers mostly to the observation of the existence of the SA:Z and XA:Z, optional flags, although I have followed the recommendations described in nummerous posts in the forums.
I could add that I have single end reads...
Dear Theodore;
I am having equal difficulty filtering for uniquely aligned reads using bwa0.7.10. I simply want to extract all reads that align exactly once to the genome. In STAR, MAPQ is calculated as a function of how many times it can be potentially aligned, so I can simply filter based on that. I'm not sure what strategy is appropriate for bwa, where MAPQ is calculated differently. Would outputing only the highest alignments do the trick? I am hoping you have made progress on this front.
Regards,
Julien
do you have figured out the question to extract all reads that align exactly once to the genome successfully ?
if I am not mistaken the following as noted above should do it
try to map with:
and then from the sam file, before converting it to bam do:
the
-F 256
applies only if the-M
has been used during mappingI hope that it helps