Question

SAM/BAM file filtering

0

Entering edit mode

8.0 years ago

#### ▴ 220

I want to filter out reads from SAM & BAM file separately which are :

1.uniquely mapping &

2,with 0 mismatches

For SAM files I have tried this :

awk '{for(i=1;i<=NF;i++) if($i~/NH:i:1\t/ || $i~/NH:i:1$/) print}' reads.sam >reads_filter.sam

But it didn't work.

bam • 6.7k views

ADD COMMENT • link updated 8.0 years ago by d-cameron ★ 2.9k • written 8.0 years ago by #### ▴ 220

0

Entering edit mode

Hint: grep -w (not that I find this to be a good approach to begin with).

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?

Also, the tags mentioned in SAM file obtained from different aligners varies ??

ADD REPLY • link 8.0 years ago by #### ▴ 220

0

Entering edit mode

I have finally used following command to filter the reads. awk '/XM:i:0[\t$]/ {print}' input.sam >output.sam

Now I am trying to convert the output.sam to bam file samtools view -S output.sam -b -o output.bam

but its throwing following message:

[sam_header_line_parse] expected '@XY', got [@HD VN:1.0 SO:unsorted] Hint: The header tags must be tab-separated. [samopen] no @SQ lines in the header.

what should I do to separate the tags based on tab ?

ADD REPLY • link 8.0 years ago by #### ▴ 220

2

Entering edit mode

You need to include the SAM header in your output file if you want to convert it to BAM format. Something like

samtools view -H  input.sam > output.sam && awk '/XM:i:0[\t$]/ {print}' input.sam >> output.sam

should do the trick.

ADD REPLY • link 8.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

I used the following and it worked as well:

awk 'substr ($0,1,1) == "@" || /XM:i:0[\t$]/ {print}' input.sam >filter.sam

where : substr ($0,1,1) == "@ print header in the output

ADD REPLY • link 8.0 years ago by #### ▴ 220

score 2 · Answer 1 · 2016-11-26

2

Entering edit mode

8.0 years ago

Tonor ▴ 480

Uniquely mapped reads: Extract uniquely mapped reads from bam file

ADD COMMENT • link 8.0 years ago by Tonor ▴ 480

score 1 · Answer 2 · 2016-11-26

1

Entering edit mode

8.0 years ago

d-cameron ★ 2.9k

The NH tag is the number of hits.You are looking for NM tag which is the Number of Mismatches

http://samtools.github.io/hts-specs/SAMtags.pdf

ADD COMMENT • link 8.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

NM is actually the edit distance. If the OP is actually interested in mismatches rather than edit distance, Reformat from the BBMap package has an option for that, the "subfilter" flag. Reformat doesn't currently filter on MAPQ, though; I'll add that.

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

score 1 · Answer 3 · 2016-11-26

1

Entering edit mode

8.0 years ago

Pierre Lindenbaum 164k

using samjs: https://github.com/lindenb/jvarkit/wiki/SamJS

 java -jar dist/samjs.jar -e 'record.getAttribute("NM")!=null && record.getAttribute("NM").intValue()==0 && record.getMappingQuality()>=5'  in.bam

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

this tool has moved to Picard and known as FilterSamReads but I am not sure how to filter uniquely mapped reads using this

ADD REPLY • link 8.0 years ago by #### ▴ 220

0

Entering edit mode

How do you define 'uniquely mapped'? The default alignment mode of bwa reports the best alignment with a mapping quality score which, according to the SAM spec, is a phred-scaled score just like base quality scores. Filtering out reads that map to multiple locations is what

record.getMappingQuality()>=5

is doing (although personally, I use a more conservative threshold of mapq of 10 instead of 5).

ADD REPLY • link 8.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

I am using botwtie2 aligner

ADD REPLY • link 8.0 years ago by #### ▴ 220

score 0 · Answer 4 · 2016-11-27

0

Entering edit mode

8.0 years ago

#### ▴ 220

I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?

Also, the tags mentioned in SAM file obtained from different aligners varies ??

ADD COMMENT • link 8.0 years ago by #### ▴ 220

2

Entering edit mode

Aligners are free to include (or not include) any tags they like. Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.

filter my file with 0 mismatches

That depends on exactly what you want. If an alignment has 2 inserted bases, but all the aligned bases match perfectly, (eg CIGAR of 50M2I50M) do you want to include these reads? If so (and you are using bwa), you want the XM (and possibly the XN) tags. If you want to exclude reads that have any indels in their alignment, then you want the NM tag as that counts insertions and deletions as well as mismatches between aligned bases.

ADD REPLY • link 8.0 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.

I had to double-check that; somehow I never noticed that lowercase tags were allowed. Good to know!

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k