Hi,
Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?
Thanks,
N.
Hi,
Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?
Thanks,
N.
Assuming "length of alignment" is measured on the reference, this will do it for SAM:
perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60'
Combine with samtools as needed, e.g:
samtools view -h foo.bam | perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60 or /^@/' | samtools view -bS - > bar.bam
Here is a solution using the picard API:
import java.io.File;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;
public class Biostar12433 {
public static void main(String[] args) throws Exception
{
if(args.length!=1) return;
final int LENGTH=60;
SAMFileReader inputSam=new SAMFileReader(new File(args[0]));
inputSam.setValidationStringency(ValidationStringency.SILENT);
SAMRecordIterator iter=inputSam.iterator();
SAMFileWriterFactory sf=new SAMFileWriterFactory();
sf.setCreateIndex(false);
SAMFileWriter sw=sf.makeSAMWriter(inputSam.getFileHeader(),false,System.out);
while(iter.hasNext())
{
SAMRecord rec=iter.next();
if(rec.getReadUnmappedFlag()) continue;
int length=rec.getAlignmentEnd()-rec.getAlignmentStart();
if(length<LENGTH)
{
continue;
}
sw.addAlignment(rec);
}
iter.close();
inputSam.close();
sw.close();
}
}
Compilation:
javac -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433.java
Execution:
java -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433 file.bam > result.sam
Here I've used rec.getAlignmentEnd()-rec.getAlignmentStart();
but you could also use getUnclippedStart/getUnclippedEnd. See http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html
Here is yet another version, in C, for a change. It implements different length functions and it should be fairly easy to accommodate to whatever your needs are.
#include <stdio.h>
#include <samtools/sam.h>
int len_query(const bam1_t *b) {
// This one includes soft-clipped nucleotides
return bam_cigar2qlen(&b->core, bam1_cigar(b));
}
int len_genome(const bam1_t *b) {
// length in genome-cooredinates
return bam_calend(&b->core, bam1_cigar(b)) - b->core.pos;
}
int len_align(const bam1_t *b) {
// pedestrian way, fill in whatever choice you like, currently caculates number of (non-insert, non-delete)
// aligned (match & missmatch) nt's of the query
int i;
uint32_t *cigar = bam1_cigar(b);
int l = 0;
for(i=0; i < b->core.n_cigar; ++i) {
const int opl = cigar[i] >> BAM_CIGAR_SHIFT;
const int op = cigar[i] & BAM_CIGAR_MASK;
if(op == BAM_CMATCH) l+= opl;
}
return l;
}
int main(int argc, char **argv) {
int (*len_func)(const bam1_t *);
if(argc < 4) {
fprintf(stderr, "Insuficient arguments, Usage: %s len mode in.bam out.bam\n", argv[0]);
exit(-1);
}
switch(*argv[2]) {
case 'Q':
len_func = len_query;
break;
case 'A':
len_func = len_align;
break;
case 'G':
len_func = len_genome;
break;
default:
fprintf(stderr, "unsupported mode, known are _Q_uery-length, _A_ligned length and _G_enomic length\n");
exit(-1);
break;
}
int len=atoi(argv[1]);
samfile_t *I = samopen(argv[3], "rb", 0);
if(!I) {
printf("Can not open sam input %s\n", argv[3]);
exit(-1);
}
samfile_t *O = samopen(argv[4], "wb", I->header);
if(!O) {
printf("Can not open sam output %s\n", argv[4]);
exit(-1);
}
bam1_t *b = bam_init1();
while (samread(I, b) >= 0) {
int l = len_func(b);
if(len != len_func(b)) continue;
samwrite(O, b);
}
samclose(I);
samclose(O);
bam_destroy1(b);
exit(0);
}
Same as Pierres in Scala.
save as filter.scala
start with:
sh filter.scala in.bam out.bam 60
Because the problem is not really complex, it does not show off scalas benefits, apart from being able to call it as a script.
#!/bin/sh
cp='sam-1.50.jar;picard-1.50.jar'
exec scala -classpath $cp "$0" "$@"
!#
import scala.collection.JavaConverters._
import net.sf.samtools._
import net.sf.picard.filter._
import net.sf.picard.io.IoUtil
import net.sf.samtools.util.CloserUtil
import net.sf.samtools.SAMFileReader.ValidationStringency
import java.io.File
class SamFilteredCopy(inName: String, outName: String, filterOut: SAMRecord => Boolean) {
val input = new SAMFileReader(new File(inName))
val output = if(outName == "stdout"){
new SAMFileWriterFactory().makeSAMWriter(input.getFileHeader, false, System.out)
}else{
val outFile = new java.io.File(outName)
IoUtil.assertFileIsWritable(outFile)
new SAMFileWriterFactory().makeSAMOrBAMWriter(input.getFileHeader, false, outFile)
}
input.setValidationStringency(ValidationStringency.SILENT)
for(record <- input.iterator.asScala if filterOut(record)){
output.addAlignment(record)
}
CloserUtil.close(input)
CloserUtil.close(output)
}
def filter(length: Int)(record: SAMRecord): Boolean = {
record.getAlignmentEnd - record.getAlignmentStart + 1 > length
}
if(args.length < 3){
error("need three arguments: inName, outName (stdout), size")
}
new SamFilteredCopy(args(0),args(1),filter(args(2).toInt) _)
I think this depends on the aligner you use. E.g. bwa will report (AFAICS) the full sequence, and use soft clipping at the ends, while gsMapper will use hard clipping, and report only the matched sequence.
I think you will have to parse the CIGAR field and count (sum up) the number of M's.
You can't just count the M operators - at very least you must also include the X and = characters (mismatch and match), which are a more explicit alternative to M (match or mismatch). In fact you probably should count the I operations too.
It might be simpler to take the read's sequence and deduct any soft trimming (the number of S operations in the CIGAR string)?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
nice! I am still in the world of parsing out the CIGAR strings, need to change that
6.5 years later: use samjdk : http://lindenb.github.io/jvarkit/SamJdk.html . Something like: