I am a recent graduate of Bioinformatics. As a research intern, I have been tasked with creating a pangenome and then call variants using short read data. I came across vg and vg giraffe and found it useful but I have a few doubts regarding the workflow.
Basically, I find it confusing that there is one pipeline using manual indexing, and another using autoindexing. That is, I'm unsure whether to follow the classical manual indexing+gbwt+giraffe+call or follow autoindex+giraffe+call. I gave both a try, but need clarity or confirmation on which to follow. pipeline 1:
delly call -o samples_delly.bcf -g teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa ANP1.sorted.bam ANP4.sorted.bam PB27.sorted.bam TNT17.sorted.bam
bcftools view samples_delly.bcf > samples_delly.vcf
vg construct -r teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa -v samples_delly.vcf.gz -p -a -S > pangenome_graph.vg
vg autoindex --workflow giraffe --prefix pangenome_graph_ai --ref-fasta teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa --vcf samples_delly.vcf.gz --target-mem 12GB --threads 8
vg giraffe -Z pangenome_graph_ai.giraffe.gbz -m pangenome_graph_ai.min -d pangenome_graph_ai.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz > samples_ai.gam -p
vg convert -p -t 8 pangenome_graph_ai.giraffe.gbz > pangenome_graph_gbz_ai.pg
vg validate pangenome_graph_ai.giraffe.gbz -a samples_ai.gam
vg validate pangenome_graph_gbz_ai.pg -a samples_ai.gam
vg augment pangenome_graph_gbz_ai.pg samples_ai.gam -A samples_ai.aug.gam -m 2 -q 10 -v -t 16 > samples_gbz_ai.aug.pg -p
vg snarls -t 8 samples_gbz_ai.aug.pg > samples_gbz_ai.aug.snarls
vg pack -x samples_gbz_ai.aug.pg -g samples_ai.aug.gam -Q 5 -t 16 -o samples_gbz_ai.aug.pack
vg call samples_gbz_ai.aug.pg -r samples_gbz_ai.aug.snarls -k samples_gbz_ai.aug.pack -t 8 > samples_variants_gbz_ai.aug.vcf --progress
Output: snippet
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Pseudomolecule_01 16422 >2459171>2459175 AGTGTTTGAGCGGTGTGTTCGATGCTGCACCCGGGGCAACTGGACCACACAAATCACACACACCGACCG A 9.54243 lowad AT=>2459171>2459172>2459173>2459174>2459175,>2459171>2459175;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.057016,-0.057016,-0.057016:0:-2.197225:0.131285:0
Pseudomolecule_01 21134 >2459320>2459322 TTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT T 9.54243 lowad AT=>2459320>2459321>2459322,>2459320>2459322;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.029361,-0.029361,-0.029361:0:-2.197225:0.067606:0
Pseudomolecule_01 49789 >2460217>2460220 GTGTTGCAAACTTAAGCTTAACTTGGACTTCAATTT G 9.54243 lowad AT=>2460217>2460218>2460219>2460220,>2460217>2460220;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.094744,-0.094744,-0.094744:0:-2.197225:0.218156:0
Pseudomolecule_01 62342 >2460611>2460640 TCATCATGTTGGAAAATTGGTCTTGAGAAGTTGAGTTTGATTCAATTGTGACCAAGTTTAAGTGGCATACACTTGATTAATATGGAATCAAGTAGTGTTATTATTATCTACATTTTATGTAGATTGTTGTAGTGTATTTATATTCTCAAAACCAATAATATGAACAAATTAATGCACAAATTTGGTCACTTAAGTATTAAATGGGAGATATATATGACATCATTCCATGTAGAAAAAAAAATGTATACAACAGCCAATAATGGGCACATTAATATTCAATGTGTATGTTATAAACATACGGATGCCCATGTGTCTGGAAAAGAGTCCAGATACGTCATAAGATGTCTAGATTCGATGAACACCTAAATGGCAGTTTCTTTTTCCGTTTAATGGCATTCATGAGGGCCATTCATGACTGGGCCATTAGGGCCCAGTAACCCCCATAAGGAAAGGCCTATATAAGGCCTTCTCCTCTCATTTGCATTTCACACCGACATCGAAGTTTCAAAGTTTCATAACATTCTTTTCTCTAAAATGTTCTCTCATGGCAGCATAATCTGCAAAGTTCTGTGTTTCTGAAATTCGTTGGGATTTGAGATTACAGTGCTCGTATCTCAAATCGAGAGTTGTTGTATCTTGAGAGACAAATTGCCGTTAAACCGTGAGCACTAGTACGGGACAATATTGTCTTAAGGGTAGAGTGTTTCACTCGCCTCAACTCATTAACGTCGGTTAAGTTTGTGCTGACTGTACTCACTGTGTCGGCTAAGTTTGTGCTGACTGTACTTATTGTTGGCTAAGTTTGTGCTAACTGTACTTATTGTATTGGCTAAATTGTGCTAACAATTTGGTGGATACGCCAAGTGCTCAAAATTATTGTTGTTTATCCAACAC T 9.54243 lowad AT=>2460611>2460612>2460613>2460614>2460615>2460616>2460617>2460618>2460619>2460620>2460621>2460622>2460623>2460624>2460625>2460626>2460627>2460628>2460629>2460630>2460631>2460632>2460633>2460634>2460635>2460636>2460637>2460638>2460639>2460640,>2460611>2460640;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-0.090614,-0.090614,-0.090614:0:-2.197225:0.208646:0
Pseudomolecule_01 91276 >2461517>2461520 CACCCAGACGCACGCATAACAAATCACACGCGCGCGCACAGACACACACAGATGCAGCAACTA C 9.54243 lowad AT=>2461517>2461518>2461519>2461520,>2461517>2461520;DP=0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/1:0:0,0:-1.130604,-1.130604,-1.130604:0:-2.197225:2.603306:0
(I tried using delly to check if variants are indeed present in my file, because the same workflow with the original vcf gave empty file with no results like case 2)
pipeline 2:
samtools faidx teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa
vg construct -r teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa -v maf135.vcf.gz -a -S -p > graph.vg
vg convert -f graph.vg > graph.gfa
vg index -x graph.xg -g graph.gcsa -t 8 -p graph.vg
vg gbwt -x graph.xg -E -o ref_paths.gbwt -p
vg gbwt -x graph.xg ref_paths.gbwt --gbz-format -g graph.gbz -p
vg index graph.xg -j graph.dist
vg minimizer -t 16 -d graph.dist graph.gbz -o graph.min
vg giraffe -Z graph.gbz -m graph.min -d graph.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz -p > ANP1_sample_giraffe.gam
vg pack -x graph.gbz -t 8 -Q 5 -g ANP1_sample_giraffe.gam -o ANP1_sample.pack
vg snarls -t 8 graph.gbz > ANP1_sample.snarls
vg call graph.gbz -r ANP1_sample.snarls -k ANP1_sample.pack -z -t 8 > ANP1_variants_vg_call.vcf
Output: empty file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
I am also attaching samples of my vcf and fasta VCF:
##fileformat=VCFv4.2
##fileDate=20231208
##source=PLINKv1.90
##contig=<ID=Pseudomolecule_01,length=20658029>
##contig=<ID=Pseudomolecule_02,length=19432826>
##contig=<ID=Pseudomolecule_03,length=18399125>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1_1 2_1 3_1 4_1 5_1 6_1 7_1 8_1 9_1 10_1 11_1 12_1 13_1 14_1 15_1 16_1 17_1 19_1 20_1 21_1 22_1 23_1 24_1 25_1 26_1 27_1 28_1 29_1 30_1 31_1 32_1 33_1 34_1 35_1 38_1 40_1 41_1 42_1 43_1 44_1 45_1 46_1 47_1 49_1 50_1 51_1 52_1 54_1 55_1 56_1 57_1 58_1 59_1 60_1 61_1 62_1 63_1 64_1 65_1 66_1 67_1 68_1 69_1 70_1 71_1 72_1 74_1 75_1 76_1 77_1 78_1 79_1 81_1 82_1 83_1 84_1 85_1 86_1 87_1 88_1 90_1 91_1 92_1 93_1 94_1 95_1 97_1 98_1 99_1 100_1 101_1 102_1 103_1 104_1 105_1 107_1 110_1 111_1 112_1 113_1 114_1 115_1 117_1 118_1 119_1 120_1 121_1 122_1 123_1 124_1 125_1 127_1 128_1 129_1 130_1 131_1 132_1 133_1 134_1 135_1 136_1 137_1 138_1 139_1 140_1 141_1 142_1 143_1 144_1 145_1 146_1 147_1 148_1 149_1 150_1
Pseudomolecule_01 6175 Pseudomolecule_01_6175 C T 0 . PR GT 0/0 0/0 0/0 0/1 0/0 1/1 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 1/1 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/1 1/1 0/0 0/1 1/1 0/0 1/1 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/1 1/1 ./. 0/0 0/0 0/0 0/0 0/1 1/1 0/1 1/1 ./. 1/1 1/1 1/1 1/1 0/0 0/0 0/0 1/1 0/0 0/0 ./. 1/1 0/0 0/0 0/0 0/0 0/1 1/1 0/0 ./. ./. 0/1 1/1 0/0 0/0 0/1 ./. 0/0 1/1 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/1 0/0 0/0
Pseudomolecule_01 6205 Pseudomolecule_01_6205 G A 0 . PR GT 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 ./. 0/0 0/1 0/0 0/0 0/0 0/1 0/0 0/1 ./. ./. 0/0 0/0 0/1 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
Pseudomolecule_01 6229 Pseudomolecule_01_6229 C T 0 . PR GT 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/0 0/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/1 0/0 0/0 0/1 0/0 0/1 ./. ./. 0/0 0/0 0/1 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1
FASTA:
>Pseudomolecule_01
gacaaaggggaagtttgagatattttgattacacaagaacaatgaaacattaactgcatgatattagctacggcaataagaaataataaaggactgactg
ttgaagaacttcctggtcaacttgaggtattttatatgttaaacaaagtcaactttttctgttaactcagattgaacaaacacacacatgaaccatccaa
acgttggcttattaatctttggattattgcgtttacaattcttgcacttcttttcttttcttcaaaaactcattcaaaactgacatagcttcagtgaact
gagcagaatcccaagatgcctgcttattgcgctcatagtgattttcaccatcagagaggactgcagaatgtgatgtatactgactccctcgactatcctc
tggatcaattatatcagttggtaaagcccactccatctttttcttatctactaacttctgtcgcaaattcagctttttagcctcaacaatatcaccttct
tttatcagttccaattgagaaacacaattctccacctaaatcaaacacatgataTtttagctccaataacacactaccagtatgatgagtagaagTccaa
aatgagcacaataagcttatacaaactcacctctgccctgcttgctcgaaactgaaaacaatAaaaacacgttttgttcagaaggctattcagtgtatta
acaagcaatggattgtatgctggCgaaacaaggtcaatgtgaccacaatgacctgtgcaatgataAgcacgttggccacatgattgacatctgcatataa
caagatAagcatgaatacacattaatacactgcatgagatttttatccagattatatatacgtacttggATgcatcatcaaggggtcccatagccgaatc
ataaagtccgccagggattggcttgtccagtatgtcaagcaaattcggattcgtgatcttcacaacgctctggcgcctcacttctacatccgtcatgaac
ccaaaatgcacggcctccacaacctcagtcgcctaaataaaaagtgccaaacatcagaGaatagctagcaaaaaaatttattaccgcatgaaaagaatac
attaatagctaaatttgtttccaattcataatttagggatttaattccaaaaactcattcaagacaataacaaaataaacaatatcatgatgttttaaag
caacaacagcaacaacaactaaagggaaaaATACAATGAACATATTGTTTTACCATGGTTCACCAGTTGGCCCAAGGTCGGAATGGCCTAAGTATCAATA
ACTTCGAATTGCTTAGAAATAAATGACAAAAATTCAACGAGAAAAATATCAATTTTCTGAAAACTGAAGGCCAATTCGGTGcttccagCttcttgaagag
aagcttggttcccagctggtgcggtttgtaaactatacaaactggggTtttggaaaaccctatcagacaAaggtatatttGagaCagagagggggGGagg
cgcagatgggtcgggcccaccataaattagaacagtcattacctgaaaaaatcagcattataataacaaAtattctctttaaactttcctccaaattaaa
caaaaaaaaaaaaaaaatctttaaactttgcttcattttttattcttgaattacctcaaaatacacaatatatattaaataattttacaaATatatatat
atgacatatattaaaaaatattaaaataaaagatgaaatattatttgtaatagatagatggatccactgtttttcaattgtttcttgtttatttaattta
aaattagatttcacattctcgtcgaggagtgtaagagaattttaagagaattttattgattcctggaaacatatatattatgtttggtggAaggaaaaga
aatgtatattgtaaatattttAgaaaaattgtcttttttctctcaactacagtgatggATATTGTAacttctagCaatttcttataagaatatttggcca
aacagagatatatttaccacaaatttaaatattaagggacgtcaatttaaatacacatttatcgagttattaaatttcaattgtaatatcttGgtgtaca
ttggttgattaTaatatcaaatTaatatatatcccGaaacttaggggtgcacattgacacaagacaacttgTttGagggacgttgtattaattgactctt
gatgaactatGacgacttatttaataccaaatgtTaagGgagaatttcatttgactttatgctatataaaattttAatatgtttggatatgaattgttta
agagaaaatttgtagaaaagtgTaatgTattaggttggAAaaaaaatagaaaatataaatgaaaaatgtaatgTaTTaaaaaaaaaTacacacAcacaca
cacaaattaGaaaacataaatgtatttgtttgtgttatgaagtaagaatttggtgggaggaaaaaaCTaaCaTacatggcagtaatacatatgattttta
gtgagatcaacccttgagaaaaattaaataaaaatagttgaagggAaagaggtccaggaccatctgtgctatattatgccctagtattAgggttatttta
gcaCtgcacattacattctaatacttcattattatgtatataaatacattggtgaaataaaatacagaattaaaacaacataaaataaaatatgttcctt
gaagaagaccagaaatcaactggagaaaatatggattttcttggataaatttctccaactacaattatattatacttagttcccccccccccctcacgtt
tattgcagctcttccagagtttaatatccaatttctcattcattcgctattttttcaatttgttttcatacttgtgatcacaattcatcaacttctttag
agacttttcattaaaaatttggcaaattgtgaagcaagattcatgaaattcacacataatttattatttaaatgaaacgtttgtaaatttgttatttgta
ggaagagaagtgcaggtcattgcaatcatataagaaatatcatacttttatcacgaaaacttttgcaaagatgtcatcaacaggttagagatcaatgtac
ctttcagaagtcagaacaactcatacatgcagaaaacaaagcgagagcactgaaaacgtaattctcgttgagcaaagtgaccttcacaaaagtttaatgg
agtaccgagcataatttgcattttccatctactcaaaacattaacagtagttgcaattacagccgttggaaaaacaatcaccccccgttatctcaagtaa
ttgctggaagcaagatatcaattccacactagaaaccttactaccgctggtaaacaaagaaaccgaaatcgagaatcatatcaaggatgattaatacgaa
aatacacgcactacagaatagagcaactgttatgcctatcacaaggtatgatcacctccgaacagggtggcaccacacccctaccaggtgcagctccacg
gccggcagcctgagcctgccaaagagaaattcaagaatcagaagaatcaaccagataaaagttaattccatgagtaaccttctcaaatgcagtactcaga
caaaacatgttattcagcacaagagcaagaacaacttactctagctcgcatggcaacagcacgtccacgtccaactccaagtgccgaacccttgcccttt
aaggtggaaataaaacggatcagcaatatttgagcaggagaagccaaaatgaataaatgggtagcgacagaatgtaccttgatcctagcttccaagcgct
taaacattggagcattcttaagcatatctggaattaccatgaacctggtagaattgagaaattaataacagtccacataaaaattgatgcaaataagcac
tcccaatagggaatgagtatacacctactgactttgctgcctctaataaagacatgctcaagttgcgagaccttcccatcctgtaataaagttgaccaga
ttagcactccaaaatagataagataatgagctttaccattattcactaaaacctgcaaaaaaaactcaaccaccacattgcatctttttttttttttttt
aaatttaattactagatataattatatagcatacagtctgcatttctgataatcactatttgagtcagaacttaacagtatatttacaGCATCAAATTTC
TCATGCCCTATTACTCATCTTCCTTCATGAATAAAGGCTCTCATAACCACTTCCTTTGCCAACACTAAAGAGTTCTATATATATATTTTTGATTACTTTG
ACTCCTGTGCTTATCTTGATTAAGAGAGTGGGCCGAGTGGCAACACTTGCAAAACTAGAACAGTTTTCAGAAATCAGCATTACAATGCCCATGTTCTATG
edit: have attached outputs of these workflows
I proceeded with the
vg autoindex
workflow as follows:I used a different VCF input from delly tool since the older VCF was not having any SVs. Its snippet is given below:
But the final step of calling variants using vg call is giving vcf files where the variations are same throughout difference samples (ANP1, ANP4 etc). The variant VCFs are having different file size but same variation data?
Why is this so? What I actually need is a variant file with all the SVs in single VCF. Since none of the "merging" of intermediary files works, I though of calling individually and then concat or merge that.