How to get percent identity from bam file ?
3
2
Entering edit mode
6.7 years ago
David ▴ 240

I have aligned raw 1D nanopore reads to a bacterial reference assembly using BWA –MEM and got the sam and bam file.

Is there a quick way to get the of percentage identity for each read from the sam/bam file ???

I would like to generate an histogram with reads count and the percent identity to the reference.

Thanks,

bwa bam samtools • 6.8k views
ADD COMMENT
0
Entering edit mode

Here is the output ??

bioalcidaejdk   -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' 1.bam

[DEBUG][BioAlcidaeJdk] Compiling :
         1  import java.util.*;
         2  import java.util.stream.*;
         3  import java.util.regex.*;
         4  import java.util.function.*;
         5  import htsjdk.samtools.*;
         6  import htsjdk.samtools.util.*;
         7  import htsjdk.variant.variantcontext.*;
         8  import htsjdk.variant.vcf.*;
         9  import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
        10  import javax.annotation.Generated;
        11  import htsjdk.variant.vcf.*;
        12  /** begin user's packages */
        13  /** end user's packages */
        14  @Generated(value="BioAlcidaeJdk",date="2018-03-22T12:40:58+0100")
        15  public class BioAlcidaeJdkCustom1825748580 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.SAMHandler {
        16    public BioAlcidaeJdkCustom1825748580() {
        17    }
        18    @Override
        19    public void execute() throws Exception {
        20     // user's code starts here 
        21     stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});
        22      //user's code ends here 
        23     }
        24  }

I don´t get anything else ???

ADD REPLY
1
Entering edit mode

Here is the output, redirecting stderr to /dev/null, using a BAM processed with bwa/samtools: (https://github.com/lindenb/jvarkit/blob/master/src/test/resources/S1.bam )

$ java -jar dist/bioalcidaejdk.jar -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' 2> /dev/null src/test/resources/S1.bam

97  479
98  696
100 476
90  1
91  4
92  28
94  73
95  241
ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This comment belongs under @Pierre's answer.

ADD REPLY
1
Entering edit mode
6.7 years ago

If you would use google you might find some hits. For me the top hit was... my own blog, containing a python solution based on either the NM or MD tag.

Shameless advertisement: plots like that (similar) are available in NanoPlot.

ADD COMMENT
0
Entering edit mode

I´ll check Nanoplot. thanks

ADD REPLY
1
Entering edit mode
2.2 years ago

Very old thread, but still few good answers for per read identity. Thought I would mention a bedtools alternative (so not purely bam based, and not so performant, but maybe still interesting)

# Step 1 Create a bed file with the start position, stop position, and edit distance based on the NM tag
bedtools bamtobed -ed -i x.bam > ed.bed &

The format of the bed file looks like this.
chr                                  start     stop      readname                              number_mismatches  read_orientation
chr                                  601       55050     8a6c37a9-44b6-416d-bf88-504e064013b7  2453  -

# Step2: Now calculate read length and then percent mismatches per read (or whatever you like in R/Python/Awk etc)
# Example only without checking the usual bed off-by-one start/stop coordinate nightmare
awk -F"\t" '{print $1"\t"$3-$2"\t"$5"\t"($5/($3-$2)*100)}'  ed.bed | head

# output format
chr   read_len NM     %identity based on NM
chr   18275   1420    7.77
chr   20187   737     3.65

Of course, a purely BAM based solution would be preferable but I haven't run across one yet (at least not in c, c++, rust etc, I don't want to use a python implementation).

Remember to think about removing supplementary alignments etc from the bam before you convert it to bed format.

ADD COMMENT
0
Entering edit mode
6.7 years ago

use the edit distance 'NM' ? Here is a solution using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().map(R->R.getReadUnmappedFlag()?0:(int)(100.0*(R.getReadLength()-R.getIntegerAttribute("NM"))/(double)R.getReadLength())).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->{println(K+"\t"+V);});' input.bam
ADD COMMENT
0
Entering edit mode

My bam file was generated as follows:

minimap2 -ax asm10 -t 8 canuAssembly.contigs.mmi demultiplex/BC01.fastq.gz | samtools sort -@8 -o 1.sorted.bam

Ignoring SAM validation error: ERROR: Record 15966, Read name 1b0bcee6-ba9e-409d-9172-4817a6e6fc4a, CIGAR covers 5190 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15970, Read name d47ea4a6-ac3a-491d-ad02-fca9ed47a2fd, CIGAR covers 4084 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15971, Read name 1b0bcee6-ba9e-409d-9172-4817a6e6fc4a, CIGAR covers 5190 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 15973, Read name 80fa6511-2211-43fd-a6cf-fc487c110a54, CIGAR covers 2272 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16109, Read name 94561bf6-e9dd-4751-acd7-3593ac6f84e5, CIGAR covers 12412 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16113, Read name 52159ffb-ab8a-409f-add4-142e065e463a, CIGAR covers 5135 bases but the sequence is 0 read bases Ignoring SAM validation error: ERROR: Record 16114, Read name 530f195d-2f2b-4c06-a2c9-f2e565ae779c, CIGAR covers 19737 bases but the sequence is 0 read bases 0 37748 96 1207

0 37748

96 1207

-2147483648 180

97 1054

98 837

99 266

70 1

74 2

80 1

82 1

83 4

84 1

85 11

86 16

87 73

88 389

89 1427

90 2762

91 2513

92 1589

93 1181

94 1275

95 1331

ADD REPLY
0
Entering edit mode

so this is your histogram : 1st column is read-length - NM, second column is occurence. negative number i (2147483648) might be due to some error in your bams (there is a CIGAR string but the SEQ is empty)

ADD REPLY
0
Entering edit mode

Sorry Pierre but I think the original question asked a read by read assessment of percent identity, rather than histogram. I was looking for that as well. Is there any way to tailor this so that one column is read name and second column is percent identity to ref? Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6