Filter Bam Or Sam Files / Only Allow A Specific Mutation
2
1
Entering edit mode
12.9 years ago
Philipp ▴ 20

I will process a bam or sam file that it only contains reads with a specific type of Mutations (for example the mismatch is T on reference Genome and C on the Reads). Can someone help me please?

bwa bam samtools filter • 3.9k views
ADD COMMENT
3
Entering edit mode
12.9 years ago

Here is a java program printing the alignments for each read. It should be easy to transform it to test about the wanted base/mutation. Note:I'm not sure about the right way to handle the cigar operators [^CDI] but it worked with samtools/examples , feel free to correct my code:

import java.io.File;

import net.sf.picard.reference.IndexedFastaSequenceFile;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar17888
    {
    public static void main(String[] args) throws Exception
        {
        if(args.length!=2)
            {
            System.err.println("ref bam");
            return;
            }
        final IndexedFastaSequenceFile faidx=new IndexedFastaSequenceFile(new File(args[0]));
        SAMFileReader inputSam=new SAMFileReader(new File(args[1]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            byte ref_sequence[]=faidx.getSubsequenceAt(
                rec.getReferenceName(), 
                rec.getAlignmentStart(),
                rec.getAlignmentEnd()
                ).getBases();
            byte read_sequence[]=rec.getReadBases();
            int ref_pos=0;
            int read_pos=0;
            Cigar cigar=rec.getCigar();


            StringBuilder refseq=new StringBuilder();
            StringBuilder readseq=new StringBuilder();
            for(CigarElement c:cigar.getCigarElements())
                {
                switch(c.getOperator())
                    {
                    case EQ:
                    case M:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append((char)ref_sequence[ref_pos++]);
                            readseq.append((char)read_sequence[read_pos++]);
                            }
                        break;
                        }
                    case I:
                    case S:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append("-");
                            readseq.append((char)read_sequence[read_pos++]);
                            }
                        break;
                        }
                    case P:break;
                    case H:break;
                    case D:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            refseq.append((char)ref_sequence[ref_pos++]);
                            readseq.append("-");
                            }
                        break;
                        }
                    case N:
                        {
                        for(int i=0;i< c.getLength();++i)
                            {
                            readseq.append("-");
                            }
                        break;
                        }
                    default:
                        {
                        throw new IllegalArgumentException(
                                c.getOperator()+"\n"+

                                        refseq+"\n"+
                                        readseq+"\n"+
                                new String(ref_sequence)+"\n"+
                                new String(read_sequence)+"\n"+
                                cigar
                                );
                        }
                    }
                //System.err.println(c.getOperator()+" "+c.getLength()+"\n["+refseq+"]\n["+readseq+"]");

                }
            System.out.println(">"+rec.getReadName()+" "+rec.getCigarString());
            System.out.println(refseq);
            System.out.println(readseq);
            System.out.println();
            }
        iter.close();
        inputSam.close();
        }

    }

Compilation:

javac -cp picard-1.62.jar:sam-1.62.jar Biostar17888.java

Execution:

java -cp picard-1.62.jar:sam-1.62.jar:. Biostar17888   samtools-0.1.18/examples/toy.fa samtools-0.1.18/examples/toy.bam

>r001 8M4I4M1D3M
TTAGATAA----GATAGCTG
TTAGATAAAGAGGATA-CTG

>r002 1S2I6M1P1I1P1I4M2I
---AGATAA--GATA--
AAAAGATAAGGGATAAA

>r003 5H6M
AGATAA
AGCTAA

>r004 6M14N1I5M
ATAGCT-GTGCT
ATAGCT--------------CTCAGC

>r003 6H5M
TAGGC
TAGGC

>r001 9M
CAGCGCCAT
CAGCGCCAT

>x1 20M
aggttttataaaacaattaa
AGGTTTTATAAAACAAATAA

>x2 21M
ggttttataaaacaattaagt
GGTTTTATAAAACAAATAATT

>x3 9M4I13M
ttataaaac----aattaagtctaca
TTATAAAACAAATAATTAAGTCTACA

>x4 25M
aaaacaattaagtctacagagcaac
CAAATAATTAAGTCTACAGAGCAAC

>x5 24M
aacaattaagtctacagagcaact
AATAATTAAGTCTACAGAGCAACT

>x6 23M
caattaagtctacagagcaacta
TAATTAAGTCTACAGAGCAACTA
ADD COMMENT
2
Entering edit mode
12.9 years ago
Christof Winter ★ 1.1k

The MD tag of an aligned read contains a string encoding all mismatches along with their position in the read. You could scan the MD tags for T's and then check if the read sequence has a C at that position.

To do this in Python, install pysam (http://code.google.com/p/pysam/) and start with:

import pysam

nuclRef = "T"
nuclRead = "C"

bam = pysam.Samfile("your.bam")

for read in bam.fetch():
    if read.is_unmapped:
        continue
    md = read.opt("MD")
    if nuclRef in md:
        print md, read.seq

You only have to add a function that parses the MD string and returns the positions of all mismatching T's which you then can map to read.seq to see if a C is there.

ADD COMMENT

Login before adding your answer.

Traffic: 2134 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6