Question

Bam : Extract Only Alignment With A Specific Length

1

Entering edit mode

13.1 years ago

Nicolas Rosewick 11k

Hi,

Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?

Thanks,

N.

bam rna alignment extraction length • 13k views

ADD COMMENT • link updated 13.1 years ago by Marvin ▴ 900 • written 13.1 years ago by Nicolas Rosewick 11k

score 7 · Answer 1 · 2011-10-07

Assuming "length of alignment" is measured on the reference, this will do it for SAM:

perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60'

Combine with samtools as needed, e.g:

samtools view -h foo.bam | perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60 or /^@/' | samtools view -bS - > bar.bam

score 5 · Answer 2 · 2011-09-22

Here is a solution using the picard API:

import java.io.File;

import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar12433 {

    public static void main(String[] args) throws Exception
        {
        if(args.length!=1) return;
        final int LENGTH=60;
        SAMFileReader inputSam=new SAMFileReader(new File(args[0]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        SAMFileWriterFactory sf=new SAMFileWriterFactory();
        sf.setCreateIndex(false);
        SAMFileWriter sw=sf.makeSAMWriter(inputSam.getFileHeader(),false,System.out);

        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            int length=rec.getAlignmentEnd()-rec.getAlignmentStart();
            if(length<LENGTH)
                {
                continue;
                }
            sw.addAlignment(rec);
            }
        iter.close();
        inputSam.close();
        sw.close();
        }

}

Compilation:

javac -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433.java

Execution:

 java -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433 file.bam > result.sam

Here I've used rec.getAlignmentEnd()-rec.getAlignmentStart(); but you could also use getUnclippedStart/getUnclippedEnd. See http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html

score 5 · Answer 3 · 2011-10-07

Here is yet another version, in C, for a change. It implements different length functions and it should be fairly easy to accommodate to whatever your needs are.

#include <stdio.h>
#include <samtools/sam.h>


int len_query(const bam1_t *b) {
  // This one includes soft-clipped nucleotides
  return  bam_cigar2qlen(&b->core, bam1_cigar(b));
}

int len_genome(const bam1_t *b) {
  // length in genome-cooredinates
  return  bam_calend(&b->core, bam1_cigar(b)) - b->core.pos;
}

int len_align(const bam1_t *b)  {
  // pedestrian way, fill in whatever choice you like, currently caculates number of (non-insert, non-delete) 
  // aligned (match & missmatch) nt's of the query
  int i;
  uint32_t *cigar = bam1_cigar(b);
  int l  = 0;
  for(i=0; i < b->core.n_cigar; ++i) {
    const int opl  = cigar[i] >> BAM_CIGAR_SHIFT;
    const int op   = cigar[i] & BAM_CIGAR_MASK;
    if(op == BAM_CMATCH)   l+= opl;
  }
  return l;
}


int main(int argc, char **argv) {
  int (*len_func)(const bam1_t *);
  if(argc < 4) {
    fprintf(stderr, "Insuficient arguments, Usage: %s len mode in.bam out.bam\n", argv[0]);
    exit(-1);
  } 
  switch(*argv[2]) {
  case 'Q': 
    len_func = len_query;
    break;

  case 'A': 
    len_func = len_align;
    break;

  case 'G': 
    len_func = len_genome;
    break;

  default:
    fprintf(stderr, "unsupported mode, known are _Q_uery-length, _A_ligned length and _G_enomic length\n");
    exit(-1);
    break;
  }

  int len=atoi(argv[1]);

  samfile_t *I = samopen(argv[3], "rb", 0);
  if(!I) {
    printf("Can not open sam input %s\n", argv[3]);
    exit(-1);
  }
  samfile_t *O = samopen(argv[4], "wb", I->header);
  if(!O) {
    printf("Can not open sam output %s\n", argv[4]);
    exit(-1);
  }

  bam1_t *b = bam_init1();

  while (samread(I, b) >= 0) {
    int l = len_func(b);
    if(len != len_func(b)) continue;
    samwrite(O, b);
  }
  samclose(I);
  samclose(O);
  bam_destroy1(b);
  exit(0);
}

score 3 · Answer 4 · 2011-09-22

Same as Pierres in Scala.

save as filter.scala

start with:

sh filter.scala in.bam out.bam 60

Because the problem is not really complex, it does not show off scalas benefits, apart from being able to call it as a script.

#!/bin/sh
cp='sam-1.50.jar;picard-1.50.jar'
exec scala -classpath $cp "$0" "$@"
!#

import scala.collection.JavaConverters._
import net.sf.samtools._
import net.sf.picard.filter._
import net.sf.picard.io.IoUtil
import net.sf.samtools.util.CloserUtil
import net.sf.samtools.SAMFileReader.ValidationStringency
import java.io.File

class SamFilteredCopy(inName: String, outName: String, filterOut: SAMRecord => Boolean) {
     val input = new SAMFileReader(new File(inName))
     val output = if(outName == "stdout"){
         new SAMFileWriterFactory().makeSAMWriter(input.getFileHeader, false, System.out)
     }else{
         val outFile = new java.io.File(outName)
         IoUtil.assertFileIsWritable(outFile)
         new SAMFileWriterFactory().makeSAMOrBAMWriter(input.getFileHeader, false, outFile)
     }
     input.setValidationStringency(ValidationStringency.SILENT)
     for(record <- input.iterator.asScala if filterOut(record)){
         output.addAlignment(record)            
     }  
     CloserUtil.close(input)
     CloserUtil.close(output)
}

def filter(length: Int)(record: SAMRecord): Boolean = {
    record.getAlignmentEnd - record.getAlignmentStart + 1 > length 
}

if(args.length < 3){
    error("need three arguments: inName, outName (stdout), size")
}
new SamFilteredCopy(args(0),args(1),filter(args(2).toInt) _)

score 1 · Answer 5 · 2011-09-22

1

Entering edit mode

13.1 years ago

Ketil 4.1k

I think this depends on the aligner you use. E.g. bwa will report (AFAICS) the full sequence, and use soft clipping at the ends, while gsMapper will use hard clipping, and report only the matched sequence.

I think you will have to parse the CIGAR field and count (sum up) the number of M's.

ADD COMMENT • link 13.1 years ago by Ketil 4.1k

0

Entering edit mode

I used bowtie with -S option and then samtools to convert sam to bam

ADD REPLY • link 13.1 years ago by Nicolas Rosewick 11k

0

Entering edit mode

You can't just count the M operators - at very least you must also include the X and = characters (mismatch and match), which are a more explicit alternative to M (match or mismatch). In fact you probably should count the I operations too.

It might be simpler to take the read's sequence and deduct any soft trimming (the number of S operations in the CIGAR string)?

ADD REPLY • link 13.1 years ago by Peter 6.0k