I'm comparing using STAMPY vs BWA to map some reads to a reference genome.
Now that I've generated two sam files how do I tell how well the mapping did?
I'm comparing using STAMPY vs BWA to map some reads to a reference genome.
Now that I've generated two sam files how do I tell how well the mapping did?
Convert the .sam to .bam with samtools view. .sams are huge. samtools flagstat will take a .bam, and tell you how many reads mapped total.
One can generate some pretty useful stats using Picard tools:
http://picard.sourceforge.net/picard-metric-definitions.shtml#AlignmentSummaryMetrics
http://picard.sourceforge.net/picard-metric-definitions.shtml#DuplicationMetrics
If this is a capture experiment, you can also try:
http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics
e.g. you can count the mapped reads and use the percentage of mapped reads as an indicator of how well the mapping tool performed.
In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:
perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc
short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.
Use grep to find reads that match to the reference. Generally speaking, only lines with matches contain 'M' at position 6 in each line . Use cut -f 6 file.sam| grep M | wc -l
In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:
perl -F't' -ane 'if($F[5]=~/M/){print "$F[0]n";};' file.sam | sort | uniq | wc -l
short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.
In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:
perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc -l
short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think that's probably true. You could use awk to filter out reads that have mapping quality of 0 from your .sam. I'm not sure that you'd expect a great deal of difference in how many reads have mapping quality of 0 between two aligners, since that's mostly about what your reference sequence is, and how long your reads are.
I fear samtools flagstat counts reads mapping multiple times not unique. Correct me, if I'm wrong.
Well...even a small difference that (might) happen, are differences, right? ;)
Well...even small differences that (might) happen, are differences, right? ;)
Quick Question: Does bam file need to be sorted in order to get the correct flagstat file?
The sorting does not have an effect on flagstat, I do not believe.
I would recommend you to check it. Run flagstat on both files and compare the results.
Just confirmed.BAM file does not need to be sorted for flagstat!