Bam : Extract Only Alignment With A Specific Length
5
1
Entering edit mode
13.3 years ago

Hi,

Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?

Thanks,

N.

bam rna alignment extraction length • 14k views
ADD COMMENT
7
Entering edit mode
13.3 years ago
Marvin ▴ 900

Assuming "length of alignment" is measured on the reference, this will do it for SAM:

perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60'

Combine with samtools as needed, e.g:

samtools view -h foo.bam | perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60 or /^@/' | samtools view -bS - > bar.bam
ADD COMMENT
5
Entering edit mode
13.3 years ago

Here is a solution using the picard API:

import java.io.File;

import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar12433 {

    public static void main(String[] args) throws Exception
        {
        if(args.length!=1) return;
        final int LENGTH=60;
        SAMFileReader inputSam=new SAMFileReader(new File(args[0]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        SAMFileWriterFactory sf=new SAMFileWriterFactory();
        sf.setCreateIndex(false);
        SAMFileWriter sw=sf.makeSAMWriter(inputSam.getFileHeader(),false,System.out);

        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            int length=rec.getAlignmentEnd()-rec.getAlignmentStart();
            if(length<LENGTH)
                {
                continue;
                }
            sw.addAlignment(rec);
            }
        iter.close();
        inputSam.close();
        sw.close();
        }

}

Compilation:

javac -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433.java

Execution:

 java -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433 file.bam > result.sam

Here I've used rec.getAlignmentEnd()-rec.getAlignmentStart(); but you could also use getUnclippedStart/getUnclippedEnd. See http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html

ADD COMMENT
0
Entering edit mode

nice! I am still in the world of parsing out the CIGAR strings, need to change that

ADD REPLY
0
Entering edit mode

6.5 years later: use samjdk : http://lindenb.github.io/jvarkit/SamJdk.html . Something like:

java -jar dist/samjdk.jar -e 'return record.getReadUnmappedFlag() || rec.getEnd()-rec.getStart()>=60;' in.bam
ADD REPLY
5
Entering edit mode
13.3 years ago
Sven Bilke ▴ 70

Here is yet another version, in C, for a change. It implements different length functions and it should be fairly easy to accommodate to whatever your needs are.

#include <stdio.h>
#include <samtools/sam.h>


int len_query(const bam1_t *b) {
  // This one includes soft-clipped nucleotides
  return  bam_cigar2qlen(&b->core, bam1_cigar(b));
}

int len_genome(const bam1_t *b) {
  // length in genome-cooredinates
  return  bam_calend(&b->core, bam1_cigar(b)) - b->core.pos;
}

int len_align(const bam1_t *b)  {
  // pedestrian way, fill in whatever choice you like, currently caculates number of (non-insert, non-delete) 
  // aligned (match & missmatch) nt's of the query
  int i;
  uint32_t *cigar = bam1_cigar(b);
  int l  = 0;
  for(i=0; i < b->core.n_cigar; ++i) {
    const int opl  = cigar[i] >> BAM_CIGAR_SHIFT;
    const int op   = cigar[i] & BAM_CIGAR_MASK;
    if(op == BAM_CMATCH)   l+= opl;
  }
  return l;
}


int main(int argc, char **argv) {
  int (*len_func)(const bam1_t *);
  if(argc < 4) {
    fprintf(stderr, "Insuficient arguments, Usage: %s len mode in.bam out.bam\n", argv[0]);
    exit(-1);
  } 
  switch(*argv[2]) {
  case 'Q': 
    len_func = len_query;
    break;

  case 'A': 
    len_func = len_align;
    break;

  case 'G': 
    len_func = len_genome;
    break;

  default:
    fprintf(stderr, "unsupported mode, known are _Q_uery-length, _A_ligned length and _G_enomic length\n");
    exit(-1);
    break;
  }

  int len=atoi(argv[1]);

  samfile_t *I = samopen(argv[3], "rb", 0);
  if(!I) {
    printf("Can not open sam input %s\n", argv[3]);
    exit(-1);
  }
  samfile_t *O = samopen(argv[4], "wb", I->header);
  if(!O) {
    printf("Can not open sam output %s\n", argv[4]);
    exit(-1);
  }

  bam1_t *b = bam_init1();

  while (samread(I, b) >= 0) {
    int l = len_func(b);
    if(len != len_func(b)) continue;
    samwrite(O, b);
  }
  samclose(I);
  samclose(O);
  bam_destroy1(b);
  exit(0);
}
ADD COMMENT
0
Entering edit mode

--Hi,

nice usefull code, is there a way to add filters on chromosome and range such as: %s len mode chr range in.bam out.bam

thx

ADD REPLY
3
Entering edit mode
13.3 years ago
Ido Tamir 5.2k

Same as Pierres in Scala.

save as filter.scala

start with:

sh filter.scala in.bam out.bam 60

Because the problem is not really complex, it does not show off scalas benefits, apart from being able to call it as a script.

#!/bin/sh
cp='sam-1.50.jar;picard-1.50.jar'
exec scala -classpath $cp "$0" "$@"
!#

import scala.collection.JavaConverters._
import net.sf.samtools._
import net.sf.picard.filter._
import net.sf.picard.io.IoUtil
import net.sf.samtools.util.CloserUtil
import net.sf.samtools.SAMFileReader.ValidationStringency
import java.io.File

class SamFilteredCopy(inName: String, outName: String, filterOut: SAMRecord => Boolean) {
     val input = new SAMFileReader(new File(inName))
     val output = if(outName == "stdout"){
         new SAMFileWriterFactory().makeSAMWriter(input.getFileHeader, false, System.out)
     }else{
         val outFile = new java.io.File(outName)
         IoUtil.assertFileIsWritable(outFile)
         new SAMFileWriterFactory().makeSAMOrBAMWriter(input.getFileHeader, false, outFile)
     }
     input.setValidationStringency(ValidationStringency.SILENT)
     for(record <- input.iterator.asScala if filterOut(record)){
         output.addAlignment(record)            
     }  
     CloserUtil.close(input)
     CloserUtil.close(output)
}

def filter(length: Int)(record: SAMRecord): Boolean = {
    record.getAlignmentEnd - record.getAlignmentStart + 1 > length 
}

if(args.length < 3){
    error("need three arguments: inName, outName (stdout), size")
}
new SamFilteredCopy(args(0),args(1),filter(args(2).toInt) _)
ADD COMMENT
0
Entering edit mode

Love the scala use in bioinformatics. Wish it was more prevalent!

ADD REPLY
1
Entering edit mode
13.3 years ago
Ketil 4.1k

I think this depends on the aligner you use. E.g. bwa will report (AFAICS) the full sequence, and use soft clipping at the ends, while gsMapper will use hard clipping, and report only the matched sequence.

I think you will have to parse the CIGAR field and count (sum up) the number of M's.

ADD COMMENT
0
Entering edit mode

I used bowtie with -S option and then samtools to convert sam to bam

ADD REPLY
0
Entering edit mode

You can't just count the M operators - at very least you must also include the X and = characters (mismatch and match), which are a more explicit alternative to M (match or mismatch). In fact you probably should count the I operations too.

It might be simpler to take the read's sequence and deduct any soft trimming (the number of S operations in the CIGAR string)?

ADD REPLY

Login before adding your answer.

Traffic: 1375 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6