Hi ,
I am new to NGS analysis. Recently, I aligned illumina paired-end short reads against mouse reference genome using BWA. I only used the trimming option (-q 15) for BWA and left other parameters unchanged. Now, I have SAM files with all different kinds of FLAG in the second column of the SAM file. I know these FLAG give us an idea about the orientation of the paired reads and also whether both reads belonging to a pair mapped or not. Also, the fifth column (MAPQ) gives the mapping quality score. I have the following questions:
1) Can I directly use these bam (sam) files with unmapped/low mapping quality reads or improperly placed paired-reads for variant calling (SNPs, Indels) using different tools including samtools, GATK etc. Will these programs automatically discard or ignore all the irrelevant mapped reads such as two reads belonging to same pair but mapped to different chromosomes (it is possible in few cancer samples but not in the one that I am analyzing) OR one has to write a script to automatically extract the paired reads that are confidently and properly mapped and use them for the further analysis.
2) In case the second case discussed above is true, then:
a) Should I only consider the paired reads where both of them are mapped and follow the insert size distribution ? b) It could be a case where one of the read from the pair is confidently mapped (assume MAPQ >=20) while the other read falls into repetitive genome and is not mapped or not mapped because of other reason ? I assume I should use the mapped read in this case otherwise it may make me loose lot of reads.
In general you should try to verify the optimal requirements for each of the tools. The rule of the thumb is that there are no default filtering options, one man's trash is the other's treasure. For example incorrectly paired reads could also indicate inversions or other structural variations.