Hi,
I'm trying to detect indels between pairs of homeologous scaffolds in a genome assembly.
My method so far is as follows:
1, Aligned pairs with nucmer
"nucmer -p " + mapping + row[0] + "." + row[1] + " " + seq1 + " " + seq2
2, Generated variant information with mummer show-snps
"show-snps -T " + delta + " > " + snps
3, Converted this output to vcf
"mumSNPs2vcf.py --input-header --output-header --snps " + snps + " --reference " + ref + " --type INDEL > " + snps.replace(".svs",".vcf").replace("alignments","vcfs")
4, Ran the output into bcftools stats
"bcftools stats " + vcf + " > " + stats + pfx
5, Then generated plots
"plot-vcfstats " + vcf + " -p " + stats + "plots/" + pfx
My issue comes from that bcftools stats and the following plots only show indels up to 60 bp in size, when i can see clearly in synteny alignments that there are indels larger than this between each pair.
An example bcftools stats output looks like this:
# This file was produced by bcftools stats (1.14+htslib-1.14) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats ../FINAL_ASSEMBLY/pairs/indels/vcfs/Scaffold_7.Scaffold_10.vcf
#
# Definition of sets:
# ID [2]id [3]tab-separated file names
ID 0 ../FINAL_ASSEMBLY/pairs/indels/vcfs/Scaffold_7.Scaffold_10.vcf
# SN, Summary numbers:
# number of records .. number of data rows in the VCF
# number of no-ALTs .. reference-only sites, ALT is either "." or identical to REF
# number of SNPs .. number of rows with a SNP
# number of MNPs .. number of rows with a MNP, such as CC>TT
# number of indels .. number of rows with an indel
# number of others .. number of rows with other type, for example a symbolic allele or
# a complex substitution, such as ACT>TCGA
# number of multiallelic sites .. number of rows with multiple alternate alleles
# number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#
# Note that rows containing multiple types will be counted multiple times, in each
# counter. For example, a row with a SNP and an indel increments both the SNP and
# the indel counter.
#
# SN [2]id [3]key [4]value
SN 0 number of samples: 0
SN 0 number of records: 191914
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 141344
SN 0 number of MNPs: 0
SN 0 number of indels: 50570
SN 0 number of others: 0
SN 0 number of multiallelic sites: 1238
SN 0 number of multiallelic SNP sites: 1238
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 75388 67211 1.12 74979 66365 1.13
# SiS, Singleton stats:
# SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
SiS 0 1 142599 75388 67211 50570 0 0 50570
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 0 0.000000 142599 75388 67211 50570 0 0 50570
# QUAL, Stats by quality
# QUAL [2]id [3]Quality [4]number of SNPs [5]number of transitions (1st ALT) [6]number of transversions (1st ALT) [7]number of indels
QUAL 0 . 141344 74979 66365 50570
# IDD, InDel distribution:
# IDD [2]id [3]length (deletions negative) [4]number of sites [5]number of genotypes [6]mean VAF
IDD 0 -60 5 0 .
IDD 0 -59 1 0 .
IDD 0 -55 1 0 .
IDD 0 -54 2 0 .
IDD 0 -53 1 0 .
IDD 0 -52 3 0 .
IDD 0 -50 1 0 .
IDD 0 -49 1 0 .
IDD 0 -47 1 0 .
IDD 0 -45 2 0 .
IDD 0 -44 2 0 .
IDD 0 -43 4 0 .
IDD 0 -42 1 0 .
IDD 0 -41 1 0 .
IDD 0 -39 7 0 .
IDD 0 -38 1 0 .
IDD 0 -37 3 0 .
IDD 0 -36 10 0 .
IDD 0 -35 2 0 .
IDD 0 -34 2 0 .
IDD 0 -33 1 0 .
IDD 0 -32 3 0 .
IDD 0 -31 6 0 .
IDD 0 -30 5 0 .
IDD 0 -29 8 0 .
IDD 0 -28 3 0 .
IDD 0 -27 8 0 .
IDD 0 -25 9 0 .
IDD 0 -24 10 0 .
IDD 0 -23 12 0 .
IDD 0 -22 8 0 .
IDD 0 -21 23 0 .
IDD 0 -20 24 0 .
IDD 0 -19 20 0 .
IDD 0 -18 16 0 .
IDD 0 -17 28 0 .
IDD 0 -16 26 0 .
IDD 0 -15 42 0 .
IDD 0 -14 53 0 .
IDD 0 -13 66 0 .
IDD 0 -12 86 0 .
IDD 0 -11 108 0 .
IDD 0 -10 122 0 .
IDD 0 -9 176 0 .
IDD 0 -8 251 0 .
IDD 0 -7 341 0 .
IDD 0 -6 518 0 .
IDD 0 -5 707 0 .
IDD 0 -4 1330 0 .
IDD 0 -3 2005 0 .
IDD 0 -2 3692 0 .
IDD 0 -1 17970 0 .
IDD 0 1 12090 0 .
IDD 0 2 4065 0 .
IDD 0 3 2273 0 .
IDD 0 4 1363 0 .
IDD 0 5 725 0 .
IDD 0 6 592 0 .
IDD 0 7 367 0 .
IDD 0 8 252 0 .
IDD 0 9 216 0 .
IDD 0 10 163 0 .
IDD 0 11 128 0 .
IDD 0 12 105 0 .
IDD 0 13 75 0 .
IDD 0 14 67 0 .
IDD 0 15 67 0 .
IDD 0 16 37 0 .
IDD 0 17 19 0 .
IDD 0 18 28 0 .
IDD 0 19 21 0 .
IDD 0 20 20 0 .
IDD 0 21 18 0 .
IDD 0 22 19 0 .
IDD 0 23 10 0 .
IDD 0 24 9 0 .
IDD 0 25 12 0 .
IDD 0 26 5 0 .
IDD 0 27 5 0 .
IDD 0 28 5 0 .
IDD 0 29 11 0 .
IDD 0 30 3 0 .
IDD 0 31 4 0 .
IDD 0 32 4 0 .
IDD 0 33 5 0 .
IDD 0 34 6 0 .
IDD 0 35 2 0 .
IDD 0 36 2 0 .
IDD 0 37 1 0 .
IDD 0 38 2 0 .
IDD 0 39 1 0 .
IDD 0 40 4 0 .
IDD 0 41 3 0 .
IDD 0 42 2 0 .
IDD 0 44 2 0 .
IDD 0 46 2 0 .
IDD 0 47 3 0 .
IDD 0 48 1 0 .
IDD 0 49 1 0 .
IDD 0 50 1 0 .
IDD 0 51 3 0 .
IDD 0 55 1 0 .
IDD 0 56 1 0 .
IDD 0 57 1 0 .
IDD 0 58 1 0 .
IDD 0 59 1 0 .
IDD 0 60 18 0 .
# ST, Substitution types:
# ST [2]id [3]type [4]count
ST 0 A>C 7647
ST 0 A>G 19372
ST 0 A>T 15832
ST 0 C>A 7470
ST 0 C>G 2887
ST 0 C>T 18093
ST 0 G>A 18511
ST 0 G>C 2800
ST 0 G>T 7380
ST 0 T>A 15648
ST 0 T>C 19412
ST 0 T>G 7547
# DP, Depth distribution
# DP [2]id [3]bin [4]number of genotypes [5]fraction of genotypes (%) [6]number of sites [7]fraction of sites (%)
What am i missing here? Are there settings in either nucmer of bcftools to output/detect indels larger than 60bp? I am expecting indels ranging up to and beyond tens of thousands of bp.
please excuse the python formatting; all of the flags are still present.
I have now repeated this with a few different aligners and the 60 bp indel cap is persisting. Reading bcftools stats documentation and there is mention of it missing larger indels due to a length cap being applied. There has been a flag added,
--indel-size
, to bcftools stats development version. I will try increasing the threshold and see if this calls larger indels.