Question

Filter Bam Or Sam Files / Only Allow A Specific Mutation

1

Entering edit mode

12.7 years ago

Philipp ▴ 20

I will process a bam or sam file that it only contains reads with a specific type of Mutations (for example the mismatch is T on reference Genome and C on the Reads). Can someone help me please?

bwa bam samtools filter • 3.9k views

ADD COMMENT • link updated 12.7 years ago by Pierre Lindenbaum 164k • written 12.7 years ago by Philipp ▴ 20

score 3 · Answer 1 · 2012-02-28

Here is a java program printing the alignments for each read. It should be easy to transform it to test about the wanted base/mutation. Note:I'm not sure about the right way to handle the cigar operators [^CDI] but it worked with samtools/examples , feel free to correct my code:

import java.io.File;

import net.sf.picard.reference.IndexedFastaSequenceFile;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar17888
    {
    public static void main(String[] args) throws Exception
        {
        if(args.length!=2)
            {
            System.err.println("ref bam");
            return;
            }
        final IndexedFastaSequenceFile faidx=new IndexedFastaSequenceFile(new File(args[0]));
        SAMFileReader inputSam=new SAMFileReader(new File(args[1]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            byte ref_sequence[]=faidx.getSubsequenceAt(
                rec.getReferenceName(), 
                rec.getAlignmentStart(),
                rec.getAlignmentEnd()
                ).getBases();
            byte read_sequence[]=rec.getReadBases();
            int ref_pos=0;
            int read_pos=0;
            Cigar cigar=rec.getCigar();


            StringBuilder refseq=new StringBuilder();
            StringBuilder readseq=new StringBuilder();
            for(CigarElement c:cigar.getCigarElements())
                {
                switch(c.getOperator())
                    {
                    case EQ:
                    case M:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append((char)ref_sequence[ref_pos++]);
                            readseq.append((char)read_sequence[read_pos++]);
                            }
                        break;
                        }
                    case I:
                    case S:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append("-");
                            readseq.append((char)read_sequence[read_pos++]);
                            }
                        break;
                        }
                    case P:break;
                    case H:break;
                    case D:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append((char)ref_sequence[ref_pos++]);
                            readseq.append("-");
                            }
                        break;
                        }
                    case N:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            readseq.append("-");
                            }
                        break;
                        }
                    default:
                        {
                        throw new IllegalArgumentException(
                                c.getOperator()+"\n"+

                                        refseq+"\n"+
                                        readseq+"\n"+
                                new String(ref_sequence)+"\n"+
                                new String(read_sequence)+"\n"+
                                cigar
                                );
                        }
                    }
                //System.err.println(c.getOperator()+" "+c.getLength()+"\n["+refseq+"]\n["+readseq+"]");

                }
            System.out.println(">"+rec.getReadName()+" "+rec.getCigarString());
            System.out.println(refseq);
            System.out.println(readseq);
            System.out.println();
            }
        iter.close();
        inputSam.close();
        }

    }

Compilation:

javac -cp picard-1.62.jar:sam-1.62.jar Biostar17888.java

Execution:

java -cp picard-1.62.jar:sam-1.62.jar:. Biostar17888   samtools-0.1.18/examples/toy.fa samtools-0.1.18/examples/toy.bam

>r001 8M4I4M1D3M
TTAGATAA----GATAGCTG
TTAGATAAAGAGGATA-CTG

>r002 1S2I6M1P1I1P1I4M2I
---AGATAA--GATA--
AAAAGATAAGGGATAAA

>r003 5H6M
AGATAA
AGCTAA

>r004 6M14N1I5M
ATAGCT-GTGCT
ATAGCT--------------CTCAGC

>r003 6H5M
TAGGC
TAGGC

>r001 9M
CAGCGCCAT
CAGCGCCAT

>x1 20M
aggttttataaaacaattaa
AGGTTTTATAAAACAAATAA

>x2 21M
ggttttataaaacaattaagt
GGTTTTATAAAACAAATAATT

>x3 9M4I13M
ttataaaac----aattaagtctaca
TTATAAAACAAATAATTAAGTCTACA

>x4 25M
aaaacaattaagtctacagagcaac
CAAATAATTAAGTCTACAGAGCAAC

>x5 24M
aacaattaagtctacagagcaact
AATAATTAAGTCTACAGAGCAACT

>x6 23M
caattaagtctacagagcaacta
TAATTAAGTCTACAGAGCAACTA

score 2 · Answer 2 · 2012-02-28

The MD tag of an aligned read contains a string encoding all mismatches along with their position in the read. You could scan the MD tags for T's and then check if the read sequence has a C at that position.

To do this in Python, install pysam (http://code.google.com/p/pysam/) and start with:

import pysam

nuclRef = "T"
nuclRead = "C"

bam = pysam.Samfile("your.bam")

for read in bam.fetch():
    if read.is_unmapped:
        continue
    md = read.opt("MD")
    if nuclRef in md:
        print md, read.seq

You only have to add a function that parses the MD string and returns the positions of all mismatching T's which you then can map to read.seq to see if a C is there.