Substitution error rate and indel error rate from BAM file
1
2
Entering edit mode
9.4 years ago
Andrew ▴ 60

I want to determine the substitution error rate and the indel error rate for a given BAM file.

I've been reading over the following article.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3278762/#B1

First off I just want to confirm my definitions of a substitution error and indel error are correct.

A substitution error would be when the sequencer substitutes a different base than the actual base in the sequence being sequenced. So if the actual sequence was ..ATGG.. but the sequencer read ..ACGG.., then the reading of C instead of a T would be a substitution error, since one of the reads has a C which shouldn't be there.

An indel error would be when the sequencer deletes or inserts a base that is different from the actual sequence being looked at. So if the actual sequence was ..GATG.. and the sequencer read ..GTG.. then the A being deleted would be an indel error., since the read is missing an A that should be there.

Assuming those are right, I am also confused with how to determine them from a BAM file. If a tool exists then that would be great

For substitution I know it would involve the quality score for a given base and the number of mismatches. By mismatches I just mean if a given base has a coverage of 100X and 98 are T but the other two aren't, then there are 2 mismatches for that position. I'm just not sure how exactly I would combine these to find a rate.

For indel errors I know it would involve homopolymers like mentioned in the paper but I have no idea how to find the rate.

Indel Substitution Error-rate BAM • 6.9k views
ADD COMMENT
3
Entering edit mode
9.4 years ago

BBMap gives very complete statistics for substitution, insertion, deletion, and N (no-call) rates; they get printed to the screen after mapping. Also, it can generate a histogram with the mhist flag, showing the match/sub/ins/del/N rate by position in the read; ehist shows how many reads have a specific number of substitutions; and indelhist gives the counts of indels of specific lengths.

bbmap.sh ref=reference.fa in=reads.fq out=mapped.sam mhist=mhist.txt ehist=ehist.txt indelhist=indelhist.txt

These histograms can also be generated by Reformat from a sam or bam file, but only if the cigar strings are in sam 1.4+ format (with X and = in the cigar strings instead of M). Most aligners do not generate that kind of cigar string, but you can easily check by looking at one of the mapped reads in the bam file.

If you want to exclude actual polymorphisms in the genome that do not come from sequencing, it's probably best to call variations and mutate the reference according to the calls, then map again. Alternately, for SNPs at least, you can get an estimate of the number of substitutions by running an error-correction program which will generally tell you how many errors it finds and corrects.

ADD COMMENT
4
Entering edit mode

Brian, Is there anything that BBMap doesn't do? :-)

ADD REPLY
1
Entering edit mode

There are a few, but I'm working as fast as I can to fix that ;)

ADD REPLY
1
Entering edit mode

Just as additional clarification, I presume that the statistics that bbmap outputs are based on the CIGAR elements in the mappings only (and this is also how the default statistics accumulated by RTG during mapping or calibration work), so the output "error" profile will be a composite of both the sequencer error profile and any genomic variation present. If this is an issue, you would also need to account for this, either by excluding sites of known variation during statistic calculation (rtg calibrate lets you specify a BED file of regions to compute statistics from, GATK BaseRecalibrator takes a list of known sites to exclude), or you could simply subtract the rates of genomic variation, either from those computed from a VCF for the genome, or some prior rates for an average human genome). The degree to which this matters depends on the level of sequencing error -- for sequencers with low error rates, genomic variation may be the dominating factor.

ADD REPLY
0
Entering edit mode

That's correct. For Illumina, though, it's also possible to evaluate the reads by mapping them to phix if that was spiked in, which would remove virtually all noise caused by genomic differences. It might give a somewhat different answer than mapping to the main genome if the GC differs substantially.

ADD REPLY
0
Entering edit mode

Hi Brian!

I'm actually trying to use reformat.sh from bbmap suit to calculate the indel rate from a bam file.

I've tried to launch it this way:

./reformat.sh in='assemblyVSIlluna-sorted.bam' out='assemblyVSIllumina-indels.txt'

But the output generated is in fastq format. There is a lot of options, and I'm not really sure about the right one to use.

Any advice about it ?

Thanks in advance for your help,

Roxane

ADD REPLY
0
Entering edit mode

Actually, I was thinking about indelhist=<file>, but I think I should convert my bam file inton a sam 1.4+, I guess that the actual version of samtools is outputting in this format !

ADD REPLY
1
Entering edit mode

Hi Roxane,

The current version of BBMap supports old-style cigar strings as long as the reads have MD tags, so Reformat no longer requires sam 1.4. Sam is unfortunately a poorly-designed format with half of the alignment information in the cigar string and the other half in the (optional) MD tag, which makes it extremely convoluted to parse, and impossible in the general case to interpret a sam file without the reference, since the MD tag is optional. But fortunately many aligners that use old-style cigar strings include the MD tag by default.

Samtools does not convert between old and new (1.4+) cigar strings as far as I know. By default, at least, even the latest version of samtools will output old-style cigar strings if they were present in the original bam file.

ADD REPLY
0
Entering edit mode

Hi Brian!

I could not find it, but maybe I just missed it. Is there something similar to "mhist", but to get number of match/sub/indel per position of the reference? I am interested in number of errors, both biological and technological, per nucleotide of reference, not finding SNPs etc.

Cheers!

ADD REPLY

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6