I will process a bam or sam file that it only contains reads with a specific type of Mutations (for example the mismatch is T on reference Genome and C on the Reads). Can someone help me please?
I will process a bam or sam file that it only contains reads with a specific type of Mutations (for example the mismatch is T on reference Genome and C on the Reads). Can someone help me please?
Here is a java program printing the alignments for each read. It should be easy to transform it to test about the wanted base/mutation. Note:I'm not sure about the right way to handle the cigar operators [^CDI] but it worked with samtools/examples , feel free to correct my code:
import java.io.File;
import net.sf.picard.reference.IndexedFastaSequenceFile;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;
public class Biostar17888
{
public static void main(String[] args) throws Exception
{
if(args.length!=2)
{
System.err.println("ref bam");
return;
}
final IndexedFastaSequenceFile faidx=new IndexedFastaSequenceFile(new File(args[0]));
SAMFileReader inputSam=new SAMFileReader(new File(args[1]));
inputSam.setValidationStringency(ValidationStringency.SILENT);
SAMRecordIterator iter=inputSam.iterator();
while(iter.hasNext())
{
SAMRecord rec=iter.next();
if(rec.getReadUnmappedFlag()) continue;
byte ref_sequence[]=faidx.getSubsequenceAt(
rec.getReferenceName(),
rec.getAlignmentStart(),
rec.getAlignmentEnd()
).getBases();
byte read_sequence[]=rec.getReadBases();
int ref_pos=0;
int read_pos=0;
Cigar cigar=rec.getCigar();
StringBuilder refseq=new StringBuilder();
StringBuilder readseq=new StringBuilder();
for(CigarElement c:cigar.getCigarElements())
{
switch(c.getOperator())
{
case EQ:
case M:
{
for(int i=0;i< c.getLength();++i)
{
refseq.append((char)ref_sequence[ref_pos++]);
readseq.append((char)read_sequence[read_pos++]);
}
break;
}
case I:
case S:
{
for(int i=0;i< c.getLength();++i)
{
refseq.append("-");
readseq.append((char)read_sequence[read_pos++]);
}
break;
}
case P:break;
case H:break;
case D:
{
for(int i=0;i< c.getLength();++i)
{
refseq.append((char)ref_sequence[ref_pos++]);
readseq.append("-");
}
break;
}
case N:
{
for(int i=0;i< c.getLength();++i)
{
readseq.append("-");
}
break;
}
default:
{
throw new IllegalArgumentException(
c.getOperator()+"\n"+
refseq+"\n"+
readseq+"\n"+
new String(ref_sequence)+"\n"+
new String(read_sequence)+"\n"+
cigar
);
}
}
//System.err.println(c.getOperator()+" "+c.getLength()+"\n["+refseq+"]\n["+readseq+"]");
}
System.out.println(">"+rec.getReadName()+" "+rec.getCigarString());
System.out.println(refseq);
System.out.println(readseq);
System.out.println();
}
iter.close();
inputSam.close();
}
}
Compilation:
javac -cp picard-1.62.jar:sam-1.62.jar Biostar17888.java
Execution:
java -cp picard-1.62.jar:sam-1.62.jar:. Biostar17888 samtools-0.1.18/examples/toy.fa samtools-0.1.18/examples/toy.bam
>r001 8M4I4M1D3M
TTAGATAA----GATAGCTG
TTAGATAAAGAGGATA-CTG
>r002 1S2I6M1P1I1P1I4M2I
---AGATAA--GATA--
AAAAGATAAGGGATAAA
>r003 5H6M
AGATAA
AGCTAA
>r004 6M14N1I5M
ATAGCT-GTGCT
ATAGCT--------------CTCAGC
>r003 6H5M
TAGGC
TAGGC
>r001 9M
CAGCGCCAT
CAGCGCCAT
>x1 20M
aggttttataaaacaattaa
AGGTTTTATAAAACAAATAA
>x2 21M
ggttttataaaacaattaagt
GGTTTTATAAAACAAATAATT
>x3 9M4I13M
ttataaaac----aattaagtctaca
TTATAAAACAAATAATTAAGTCTACA
>x4 25M
aaaacaattaagtctacagagcaac
CAAATAATTAAGTCTACAGAGCAAC
>x5 24M
aacaattaagtctacagagcaact
AATAATTAAGTCTACAGAGCAACT
>x6 23M
caattaagtctacagagcaacta
TAATTAAGTCTACAGAGCAACTA
The MD tag of an aligned read contains a string encoding all mismatches along with their position in the read. You could scan the MD tags for T's and then check if the read sequence has a C at that position.
To do this in Python, install pysam (http://code.google.com/p/pysam/) and start with:
import pysam
nuclRef = "T"
nuclRead = "C"
bam = pysam.Samfile("your.bam")
for read in bam.fetch():
if read.is_unmapped:
continue
md = read.opt("MD")
if nuclRef in md:
print md, read.seq
You only have to add a function that parses the MD string and returns the positions of all mismatching T's which you then can map to read.seq to see if a C is there.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.