Hello,
Is there a long range aligner that could make a statistics outlook listing how many SNP, indels etc are observed in the alignment itself?
Thank you
Hello,
Is there a long range aligner that could make a statistics outlook listing how many SNP, indels etc are observed in the alignment itself?
Thank you
BBTools' callvariants.sh calls variants only from the alignments. So:
callvariants.sh in=nxyxx.bam ref=ref.fa
yields:
50 of 342488 variants passed primary filters (0.0146%).
Type Count Rate AD Depth AF Score Qual
Substitutions: 39 78.0% 2.5 3.3 0.807 22.6 36.4
Deletions: 5 10.0% 3.2 5.2 0.693 22.0 34.6
Insertions: 6 12.0% 3.0 5.2 0.694 22.8 37.0
Variation Rate: 1/87048
Homozygous: 50 100.0%
Does that look like what you wanted? You can add the clearfilters
flag to get stats on ALL observed variants (including the ones that are likely sequencing error). Alternatively, during alignment, BBMap provides some of these statistics:
Read 1 data: pct reads num reads pct bases num bases
mapped: 83.3029% 3316002 83.6085% 488906265
unambiguous: 82.2136% 3272637 82.5206% 482544605
ambiguous: 1.0894% 43365 1.0879% 6361660
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 68.1714% 2713667 68.3977% 399960043
semiperfect site: 74.7275% 2974642 74.9602% 438334497
Match Rate: NA NA 98.2992% 480890022
Error Rate: 10.0748% 340829 0.2197% 1074954
Sub Rate: 9.5862% 324301 0.1463% 715579
Del Rate: 0.4477% 15145 0.0622% 304340
Ins Rate: 0.4401% 14890 0.0112% 55035
N Rate: 8.4204% 284863 1.4811% 7245629
It also produces a file via the "mhist" flag that you can use to visualize the distribution of the different variant types across the read length, like this:
I mainly use that when evaluating new Illumina sequencing machines.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What is a "long range" aligner?
something like MAFFT, Clustal, Muscle etc and that work on fasta files. Not Bowtie or BWA, which are needed for the short-read fastqfiles
FYI aligners take reads and align to a reference genome, returning the position of the read and the a few other statistics. They are not interested in variant calling per se but how many differences between the read and the reference. If you are interested in SNPs, INDELS, etc you need to use a variant caller. Variant callers look at the differences between the reads and the reference to call variants.
I think by 'long-range' you're referring to long-reads?
Yes, long-reads. I understand the point, but I would like to know what is the identity between two or more sequences, how many indels or snp are presents.