hi, everyone! I am working on illumina pair-end sequencing
After mapping by bwa, I got a pair of reads with MAPQ=0, with both reads mapped to more than one place. But the Template Length is OK, and I find this pair of reads in this sam file only once.
So, should we dump this mapping pair?
SPECIFIC_ID 83 Y 58900709 0 92M = 58900679 -122 AGTGCATTCCATTCCAGTCTCTTCAGTTCGATTCCATTCCATTCGTTTCGATTCCTTTCCATTCCAGCCCATTCCATTCCATTGCATTCCTT DDDCDCDDD DDDDDDDDDDDDEEC?FFFFFHHHHHJJIJIHJIJJJJIJJJJJJIJHGJJIHFJJJIGIJJIHIJIJHJJIJJIJJIFHFFH XT:A:R NM:i:1 SM:i:0 AM:i:0 X0 :i:9 X1:i:16 XM:i:1 XO:i:0 XG:i:0 MD:Z:83C8 :i:9 X1:i:16 XM:i:1 XO:i:0 XG:i:0 MD:Z:83C8
SPECIFIC_ID 163 Y 58900679 0 92M = 58900709 122 AGAACCTTCCATTACACTCCCTTCCATTCCAGTGCATTCCATTCCAGTCTCTTCAGTTCGATTCCATTCCATTCGTTTCGATTCCTTTCCAT FHGHHIIJJ JGJIIJIJJIIJJIIJEIGEGHIIJJIJIGIIIF<fgiijjhjieejijiiijhehehigggghegghjc=ehf@deef?>CD XT:A:R NM:i:0 SM:i:0 AM:i:0 X0 :i:2 X1:i:18 XM:i:0 XO:i:0 XG:i:0 MD:Z:92
Thanks very much!
Thanks Samuel! Usually I will filter sam file by myself, often by MAPQ and Template Length. So you mean if I leave it there, the calling quality would contain mapping quality information?
Yes, in GATK for example you have the RMS of the mapping quality supporting that site in the MQ field. You should subsequently filter your sites according to this (and other metrics :-) )
OK, I got it. Thanks!
I had the same confusion when I started with NGS analysis. I used to filter out unmapped and ambiguosly mapped reads. But then I realised that most of the downstream tools including samtools and GATK make use of mapping quality and other features like concordancy in the mapping of paired reads to make sure that the SNPs and Indels generated are from reads that are confidently placed on the genome. Moreover, these anomalous reads will be important as they are used by some programs to identify large indels. For example, Pindel uses read pairs where one read was confidently aligned but the other read couldnt be aligned. It then takes the other unaligned read from the BAM file and try to look for insertion and deletion. So at any point you should have all these information in the bam file. I dont filter out anything from the BAM file.
Thanks ashutoshmits! Base on your statement, we should not filter bam by MAPQ. But how about Template Length? We know that reads length should be within some range based on Gel electrophoresis.
I have never seen someone taking care of he template length. I dont mean that what you are trying to do is wrong but I assume if you are using standard platform for sequencing including SOLiD , Illumina then you normally get reads with specified size. Even if there are junk reads they wont get aligned with higher confidence or at all. BTW, I just read your original question. You can throw the reads in case both the reads from the same pair are not aligned. They wont be useful at all but there is no problem in keeping them unless you want to reduce your BAM size little bit.
I see. Thanks a lot!