Question

How To Tell How Well A Sam File Was Mapped

2

Entering edit mode

13.4 years ago

Applet ▴ 150

I'm comparing using STAMPY vs BWA to map some reads to a reference genome.

Now that I've generated two sam files how do I tell how well the mapping did?

bwa mapping sam samtools • 10k views

ADD COMMENT • link updated 10.3 years ago by Biostar 20 • written 13.4 years ago by Applet ▴ 150

score 4 · Answer 1 · 2012-03-05

4

Entering edit mode

13.4 years ago

Swbarnes2 ★ 1.6k

Convert the .sam to .bam with samtools view. .sams are huge. samtools flagstat will take a .bam, and tell you how many reads mapped total.

ADD COMMENT • link 13.4 years ago by Swbarnes2 ★ 1.6k

1

Entering edit mode

I think that's probably true. You could use awk to filter out reads that have mapping quality of 0 from your .sam. I'm not sure that you'd expect a great deal of difference in how many reads have mapping quality of 0 between two aligners, since that's mostly about what your reference sequence is, and how long your reads are.

ADD REPLY • link 13.4 years ago by Swbarnes2 ★ 1.6k

0

Entering edit mode

I fear samtools flagstat counts reads mapping multiple times not unique. Correct me, if I'm wrong.

ADD REPLY • link 13.4 years ago by David Langenberger 11k

0

Entering edit mode

Well...even a small difference that (might) happen, are differences, right? ;)

ADD REPLY • link 13.4 years ago by David Langenberger 11k

0

Entering edit mode

Well...even small differences that (might) happen, are differences, right? ;)

ADD REPLY • link 13.4 years ago by David Langenberger 11k

0

Entering edit mode

Quick Question: Does bam file need to be sorted in order to get the correct flagstat file?

ADD REPLY • link 13.3 years ago by Alby ▴ 90

0

Entering edit mode

The sorting does not have an effect on flagstat, I do not believe.

ADD REPLY • link 13.3 years ago by Sean Davis 27k

0

Entering edit mode

I would recommend you to check it. Run flagstat on both files and compare the results.

ADD REPLY • link 13.3 years ago by David Langenberger 11k

0

Entering edit mode

Just confirmed.BAM file does not need to be sorted for flagstat!

ADD REPLY • link 13.3 years ago by Alby ▴ 90

Ram · Answer 2 · 2012-03-06

3

Entering edit mode

13.4 years ago

Sean Davis 27k

One can generate some pretty useful stats using Picard tools:

http://picard.sourceforge.net/picard-metric-definitions.shtml#AlignmentSummaryMetrics

http://picard.sourceforge.net/picard-metric-definitions.shtml#DuplicationMetrics

If this is a capture experiment, you can also try:

http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.4 years ago by Sean Davis 27k

score 2 · Answer 3 · 2012-03-05

2

Entering edit mode

13.4 years ago

David Langenberger 11k

e.g. you can count the mapped reads and use the percentage of mapped reads as an indicator of how well the mapping tool performed.

ADD COMMENT • link 13.4 years ago by David Langenberger 11k

score 2 · Answer 4 · 2012-03-05

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

score 0 · Answer 5 · 2012-03-05

0

Entering edit mode

13.4 years ago

Thomasvangurp ▴ 10

Use grep to find reads that match to the reference. Generally speaking, only lines with matches contain 'M' at position 6 in each line . Use cut -f 6 file.sam| grep M | wc -l

ADD COMMENT • link 13.4 years ago by Thomasvangurp ▴ 10

0

Entering edit mode

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F't' -ane 'if($F[5]=~/M/){print "$F[0]n";};' file.sam | sort | uniq | wc -l

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

ADD REPLY • link 13.4 years ago by David Langenberger 11k

0

Entering edit mode

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc -l

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

ADD REPLY • link 13.4 years ago by David Langenberger 11k

0

Entering edit mode

That doesn't work if you have an M somewhere else, e.g. in your read name

ADD REPLY • link 13.4 years ago by Christof Winter ★ 1.1k

0

Entering edit mode

@Christof thanks. Edited the response

ADD REPLY • link 13.4 years ago by Thomasvangurp ▴ 10