SAM/BAM file filtering
4
0
Entering edit mode
8.0 years ago
#### ▴ 220

I want to filter out reads from SAM & BAM file separately which are :

1.uniquely mapping &

2,with 0 mismatches

For SAM files I have tried this :

awk '{for(i=1;i<=NF;i++) if($i~/NH:i:1\t/ || $i~/NH:i:1$/) print}' reads.sam >reads_filter.sam

But it didn't work.

bam • 6.7k views
ADD COMMENT
0
Entering edit mode

Hint: grep -w (not that I find this to be a good approach to begin with).

ADD REPLY
0
Entering edit mode

I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?

Also, the tags mentioned in SAM file obtained from different aligners varies ??

ADD REPLY
0
Entering edit mode

I have finally used following command to filter the reads. awk '/XM:i:0[\t$]/ {print}' input.sam >output.sam

Now I am trying to convert the output.sam to bam file samtools view -S output.sam -b -o output.bam

but its throwing following message:

[sam_header_line_parse] expected '@XY', got [@HD VN:1.0 SO:unsorted] Hint: The header tags must be tab-separated. [samopen] no @SQ lines in the header.

what should I do to separate the tags based on tab ?

ADD REPLY
2
Entering edit mode

You need to include the SAM header in your output file if you want to convert it to BAM format. Something like

samtools view -H  input.sam > output.sam && awk '/XM:i:0[\t$]/ {print}' input.sam >> output.sam

should do the trick.

ADD REPLY
0
Entering edit mode

I used the following and it worked as well:

awk 'substr ($0,1,1) == "@" || /XM:i:0[\t$]/ {print}' input.sam >filter.sam

where : substr ($0,1,1) == "@ print header in the output

ADD REPLY
2
Entering edit mode
8.0 years ago
Tonor ▴ 480

Uniquely mapped reads: Extract uniquely mapped reads from bam file

ADD COMMENT
1
Entering edit mode
8.0 years ago
d-cameron ★ 2.9k

The NH tag is the number of hits.You are looking for NM tag which is the Number of Mismatches

http://samtools.github.io/hts-specs/SAMtags.pdf

ADD COMMENT
0
Entering edit mode

NM is actually the edit distance. If the OP is actually interested in mismatches rather than edit distance, Reformat from the BBMap package has an option for that, the "subfilter" flag. Reformat doesn't currently filter on MAPQ, though; I'll add that.

ADD REPLY
1
Entering edit mode
8.0 years ago

using samjs: https://github.com/lindenb/jvarkit/wiki/SamJS

 java -jar dist/samjs.jar -e 'record.getAttribute("NM")!=null && record.getAttribute("NM").intValue()==0 && record.getMappingQuality()>=5'  in.bam
ADD COMMENT
0
Entering edit mode

this tool has moved to Picard and known as FilterSamReads but I am not sure how to filter uniquely mapped reads using this

ADD REPLY
0
Entering edit mode

How do you define 'uniquely mapped'? The default alignment mode of bwa reports the best alignment with a mapping quality score which, according to the SAM spec, is a phred-scaled score just like base quality scores. Filtering out reads that map to multiple locations is what

record.getMappingQuality()>=5

is doing (although personally, I use a more conservative threshold of mapq of 10 instead of 5).

ADD REPLY
0
Entering edit mode

I am using botwtie2 aligner

ADD REPLY
0
Entering edit mode
8.0 years ago
#### ▴ 220

I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?

Also, the tags mentioned in SAM file obtained from different aligners varies ??

ADD COMMENT
2
Entering edit mode

Aligners are free to include (or not include) any tags they like. Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.

filter my file with 0 mismatches

That depends on exactly what you want. If an alignment has 2 inserted bases, but all the aligned bases match perfectly, (eg CIGAR of 50M2I50M) do you want to include these reads? If so (and you are using bwa), you want the XM (and possibly the XN) tags. If you want to exclude reads that have any indels in their alignment, then you want the NM tag as that counts insertions and deletions as well as mismatches between aligned bases.

ADD REPLY
0
Entering edit mode

Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.

I had to double-check that; somehow I never noticed that lowercase tags were allowed. Good to know!

ADD REPLY

Login before adding your answer.

Traffic: 2465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6