I'm trying to filter out reads that were primarily aligned to one genomic region (if a read has secondary alignment I want to keep the primary). I've tried to do this with two methods- (i) filter out reads with MAPQ=0, which should have been aligned to more than one region. (ii) read with XA tag which should indicate alternative hits. However, there are read with MAPQ 0 but without XA tag (example 1) and reads with MAPQ>0 with XA tag.
Example 1- MAPQ 0 but without XA tag SRR1338241.3 83 chr5 11625 0 76M = 11507 -194 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT #####################################?8:8:BD?:EC8)9DEF?AB?<a++28ddd??d3dd?<? nm:i:0="" md:z:76="" as:i:76="" xs:i:76<="" p="">
Example 2- MAPQ>0 with XA tag SRR1338241.239 99 chr1 14643 25 76M = 14706 139 GTCAGAGCAATGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCGT C@CFFFFFHHHHHIIIJJJFHIJJJEGGJJJJJDBDDDD8@CDDEDDDDCBDDDDDDDDDDDEDDDDDDBDDDDDD NM:i:1 MD:Z:10C65 AS:i:71 XS:i:71 XA:Z:chr9,+14754,76M,1;chr15,-102516447,76M,2;chr16,+64329,13M1I62M,2;chr12,-90907,76M,3;chr2,-114356295,76M,3;
Does anyone has any explanation to this discrepancy? or any other suggestion to do it?
Can you give your command line you used ?
For mapping? bwa-0.7.12/bwa mem -t 16 hg19.bwa_index/hg19 file_1.fastq file_2.fastq > file.sam 2> file.log Thanks
Thank you , so your 2 examples of reads are examples after filtering methods ? Haw did you filter MAPQ 0 and XA ? Did you used command line (like this in this post BWA mem how to know if a read is mapped uniquely? ) ?
Thanks. No its before filtering. My problem is not that I have technical problem to filter it out (which can be done easily by awk or grep), my problem is that I'm actually don't understand how it make sense that some reads have MAPQ 0 without XA tag and others have XA with MAPQ greater than 0. So I believe that I'm missing something here, and maybe there is easier way to filter these out.
If i remember correctly your second case ( XA with MAPQ greater than 0 ) you can detect homologous genes with. For the MAPQ 0 in general it's low complexity region like in your example. So it's interesting if you work on this type of problematic ? (May be i m wrong ^^ )
Maybe, but since I have to know that my read were mapped correctly I have to filter these cases out.
Hello, you may want to take a look here: Bwa Mapping Qualities Of 0
This also makes for interesting reading: More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)