I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?
Also, the tags mentioned in SAM file obtained from different aligners varies ??
I have finally used following command to filter the reads.
awk '/XM:i:0[\t$]/ {print}' input.sam >output.sam
Now I am trying to convert the output.sam to bam file
samtools view -S output.sam -b -o output.bam
but its throwing following message:
[sam_header_line_parse] expected '@XY', got [@HD VN:1.0 SO:unsorted]
Hint: The header tags must be tab-separated.
[samopen] no @SQ lines in the header.
what should I do to separate the tags based on tab ?
NM is actually the edit distance. If the OP is actually interested in mismatches rather than edit distance, Reformat from the BBMap package has an option for that, the "subfilter" flag. Reformat doesn't currently filter on MAPQ, though; I'll add that.
How do you define 'uniquely mapped'? The default alignment mode of bwa reports the best alignment with a mapping quality score which, according to the SAM spec, is a phred-scaled score just like base quality scores. Filtering out reads that map to multiple locations is what
record.getMappingQuality()>=5
is doing (although personally, I use a more conservative threshold of mapq of 10 instead of 5).
I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?
Also, the tags mentioned in SAM file obtained from different aligners varies ??
Aligners are free to include (or not include) any tags they like. Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.
filter my file with 0 mismatches
That depends on exactly what you want. If an alignment has 2 inserted bases, but all the aligned bases match perfectly, (eg CIGAR of 50M2I50M) do you want to include these reads? If so (and you are using bwa), you want the XM (and possibly the XN) tags. If you want to exclude reads that have any indels in their alignment, then you want the NM tag as that counts insertions and deletions as well as mismatches between aligned bases.
Lowercase tags and tags starting with X, Y or Z are aligner-specific tags and two aligners would write the same XM tag and give them completely different meanings.
I had to double-check that; somehow I never noticed that lowercase tags were allowed. Good to know!
Hint:
grep -w
(not that I find this to be a good approach to begin with).I have another related query which is on sam file tags . XM & NM , my sam file contains both the tags and if I want to filter my file with 0 mismatches then which tag should I consider ? XM or NM?
Also, the tags mentioned in SAM file obtained from different aligners varies ??
I have finally used following command to filter the reads. awk '/XM:i:0[\t$]/ {print}' input.sam >output.sam
Now I am trying to convert the output.sam to bam file samtools view -S output.sam -b -o output.bam
but its throwing following message:
[sam_header_line_parse] expected '@XY', got [@HD VN:1.0 SO:unsorted] Hint: The header tags must be tab-separated. [samopen] no @SQ lines in the header.
what should I do to separate the tags based on tab ?
You need to include the SAM header in your output file if you want to convert it to BAM format. Something like
should do the trick.
I used the following and it worked as well:
awk 'substr ($0,1,1) == "@" || /XM:i:0[\t$]/ {print}' input.sam >filter.sam
where : substr ($0,1,1) == "@ print header in the output