Efficiently extract alignments with NH:i:1 field
3
2
Entering edit mode
9.4 years ago
rubic ▴ 270

Pretty basic question.

I have bam files in which the only information that indicates whether an alignment is unique is it's NH field. I'm interested in keeping only unique alignments, hence only alignments with NH:i:1 field. Can samtools do this? If not is there any other efficient way other than using grep?

Thanks

alignment RNA-Seq • 13k views
ADD COMMENT
0
Entering edit mode

Grep is pretty darn efficient. I can't think of a faster way to look at each read than to look at each read. Fundamentally any alternative would do the same.

ADD REPLY
0
Entering edit mode

Would grep really be as efficient as using samtools view if that was an option?

ADD REPLY
0
Entering edit mode

I meant to run samtools view and pipe through grep. If samtools has a filter flag for your tag of interest that would be a tiny bit faster. If your tag of interest is obscure then you've got to grep it. grep has a non-regex fixed mode that is very fast, but nothing is going to read a BAM faster than samtools view.

ADD REPLY
0
Entering edit mode

let's assume my aligner writes the sam output to standard output. Is there a one-liner that would pipe and grep and this directly into a bam file (including the header)?

For example, if to pipe my alignment output to bam I use:

align <fastq> | samtools view -Sb - > <bam>

How can I change it to pipe only alignments with NH = 1?

ADD REPLY
1
Entering edit mode

This should work, but there are probably faster ways...

align <fastq> | grep -P '^@|NH:i:1\b' | samtools view -Sb - > <bam>
ADD REPLY
0
Entering edit mode

Thanks thackl. It is indeed pretty slow. If anyone is following and knows a samtools command which is faster for achieving this that'll be great.

ADD REPLY
2
Entering edit mode

If you want bam2bam the following grep is about 10x faster than my previous idea. Roughly 1 minute / GB bam on my system. Depending on what type of alignments are in your bam, the samtools filter flags (no unmapped, only primary, no supplement - these cannot be unique mappings anyway) might speed up things as well.

(samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam
ADD REPLY
1
Entering edit mode

Pardon my ignorance in linux. Is the full command you're suggesting this:

align <fastq> | samtools view -Sb (samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam
ADD REPLY
3
Entering edit mode
9.4 years ago

No idea about how this compares speed-wise, but python is a bit more flexible and easier to use when filtering flags, I find. The pysam packages uses a wrapper around samtools to read in and write out BAM files.

I wrote a script once for checking all kinds of flags, tried to strip it down to what you might be interested in (haven't tested it though)

#!/usr/bin/env python

import sys
import argparse
import getopt
import pysam
import os

def get_args():
    parser=argparse.ArgumentParser(description='filter BAM files')
    parser.add_argument('--BAMfile', '-in', type=str, required=True, help="BAM file")
    parser.add_argument('--outfile', '-out', type=str, required=True, help = 'Name for output file')

    args=parser.parse_args()
    return args

def remove_multiAlignments(ReadTagsEntry, KeepInfo):
    '''Checks for NH and XS tag to remove reads that aligned more than once'''
    
    keepRead = KeepInfo
    
    if 'XS' in ReadTagsEntry:
        keepRead=False
    elif 'NH' in ReadTagsEntry and ReadTagsEntry[1] > 1:
        keepRead=False
    
    return(keepRead)

def main():
    args = get_args()
    
    infile = pysam.Samfile(args.BAMfile, "rb")
    out = pysam.Samfile(args.outfile , "wb", template = infile )
    
    keep = True

    for DNAread in infile.fetch():
        intags = DNAread.tags
         
         for entry in intags:
            keep = remove_multiAlignments(entry, keep)

        if keep:
            out.write(DNAread)

        keep = True

    out.close()

if __name__ == '__main__':
main()
ADD COMMENT
0
Entering edit mode
9.4 years ago
5heikki 11k
I don't remember the exact flags but basically samtools view -bS yourFile | awk '$X != ""' where x corresponds to the column that shouldn't be empty..
ADD COMMENT
0
Entering edit mode
8.2 years ago
abl0719 • 0

The filter function from Bamtools (https://github.com/pezmaster31/bamtools) may be a better choice here. Since the bam file is no decompressed, it should be faster than grep.

ADD COMMENT
0
Entering edit mode

Good idea, but actually, on my computer, bamtools is about 3 times slower than grep:

$ time (bamtools filter -in example.bam -tag NM:1 > /dev/null )

real    0m36.750s
user    0m36.512s
sys 0m0.216s

$ time (samtools view -h example.bam | grep '^@|NM:i:0' > /dev/null)

real    0m12.197s
user    0m14.432s
sys 0m0.788s

(I ran the commands multiple times, so the BAM file is already in the cache memories of whichever component of the computer its operating system.)

I am using the bamtools binary from Debian 8 (Jessie).

$ bamtools --version

bamtools 2.3.0
Part of BamTools API and toolkit
Primary authors: Derek Barnett, Erik Garrison, Michael Stromberg
(c) 2009-2012 Marth Lab, Biology Dept., Boston College

$ dpkg -l bamtools
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                      Version           Architecture      Description
+++-=========================-=================-=================-========================================================
ii  bamtools                  2.3.0+dfsg-2      amd64             toolkit for manipulating BAM (genome alignment) files
ADD REPLY

Login before adding your answer.

Traffic: 1122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6