Hi,
This question has probably been asked many times but it is still not very clear to me from what I read online.
For example I have two different outputs from samtools flagstat. Comparing these two samples, which one is better or worse? or what makes it good or bad? what i should look for in the flagstat output?
First One
137147 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
137147 + 0 mapped (100.00%:nan%)
137147 + 0 paired in sequencing
66552 + 0 read1
70595 + 0 read2
44296 + 0 properly paired (32.30%:nan%)
57734 + 0 with itself and mate mapped
79413 + 0 singletons (57.90%:nan%)
14598 + 0 with mate mapped to a different chr
668 + 0 with mate mapped to a different chr (mapQ>=5)
Second one:
120672 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
120672 + 0 mapped (100.00%:nan%)
120672 + 0 paired in sequencing
59601 + 0 read1
61071 + 0 read2
38796 + 0 properly paired (32.15%:nan%)
47462 + 0 with itself and mate mapped
73210 + 0 singletons (60.67%:nan%)
10266 + 0 with mate mapped to a different chr
508 + 0 with mate mapped to a different chr (mapQ>=5)
Furthermore, tophat alignment summary.txt file made me even more confusing. It sounds like the quality of the mapping is very bad?
Left reads:
Input: 3118050
Mapped: 24264 ( 0.8% of input)
of these: 12910 (53.2%) have multiple alignments (1 have >1000)
Right reads:
Input: 3118050
Mapped: 23492 ( 0.8% of input)
of these: 13094 (55.7%) have multiple alignments (2 have >1000)
0.8% overall read alignment rate.
Aligned pairs: 9112
of these: 4163 (45.7%) have multiple alignments
and: 606 ( 6.7%) are discordant alignments
0.3% concordant pair alignment rate.
Those are both pretty bad, assuming you're not expecting a lot of rearrangements. Did you not rRNA deplete the samples? Also, did you trim these prior to alignment?
Only trimming I did was to remove calls whose Phred score is lower than 20. I also allowed the aligner to find the multi-hits as i am mainly interested in finding repeats.
And how did you decide that this is bad? Is it because of the properly paired rate is too low?
What can i do to improve this?
Since you mentioned elsewhere using tophat2, I assumed you were looking for differentially expressed genes (the alignment in that case should produce fewer multimappers). If you're looking for repeats, then those are probably much better results (the two files from which you pasted output from flagstat are largely equivalent).
If by allowing the aligner to find multi-hits you mean by using the -g option, I am afraid that, at least with some TopHat2 versions, you are not going to change your "overall read alignment rate" in align_summary file. If you want to get more new alignments maybe you should try with the bowtie or fix_mapping_ordering parameters (minscore, read_edit_distance, and so on)