create a track with mismatch density from a SAM/bam file
3
I want to identify regions where a larger than average number of apparent varioants suggest an assembly error or a read alignment error.
Any idea how I could parse a SAM/BAM and count differences from the reference for each read which I then save in BED format?
I guess the solution in in the cigar but if someone has solved this before and could share some code, I would be very grateful.
Best
S
relates to Getting number of mismatches from bam record where no solution was posted so far
sam
bam
mismatch
• 2.1k views
Brilliant; I finally learned what NM was and it is exactly what I needed. Thanks Pierre
Hoping that the read is always right to the start position, I made of it the next command. Does this make sense?
samtools view -f 0x2 mappings.bam | gawk -v rlen=${read_len} 'BEGIN{FS="\t"; OFS="\t"}{($16~/NM/)?split($16,nm,":"):split($17,nm,":"); print $3,$4,$4+rlen,"NM",nm[3]}' > NM_counts.bed
The NM seems to move around in my BAM between 16 and 17th columns, hence some edits
To be then merged / grouped-by and summarised
Best
Stephane
NM is not always in the $12 , the END position of a read on the REF is not START+read_len (indels, clipping......)
here is a solution using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.hasAttribute("NM")).forEach(R->println(R.getContig()+"\t"+(R.getStart()-1)+"\t"+R.getEnd()+"\t"+R.getIntegerAttribute("NM")));' in.bam
rotavirus 5 75 7
rotavirus 5 75 3
rotavirus 5 75 2
rotavirus 6 74 6
rotavirus 6 76 6
rotavirus 6 71 5
rotavirus 6 76 7
rotavirus 6 76 6
rotavirus 6 76 6
rotavirus 6 76 7
rotavirus 6 76 5
meanwhile rediscovered bedtools bamtobed which seems to do precisely what I want (-ed)
when I can compile the java above, I will compare.
S
Login before adding your answer.
Traffic: 2229 users visited in the last hour
what about Zev's solution: the NM tag ?