I am seeing a 5-10% reduction in reads mapped to a genome graph compared to a linear reference of the same genome sequence. I have two graphs: one is the reference genome decorated with 5 million SNPs and the other is an empty graph created with vg using the reference genome sequence alone. Regardless of which graph or graph aligner I use (details below), I'm seeing similar results.
1) Is it unexpected to see fewer reads map to a genome graph than the linear reference, or am I misunderstanding? For example, is it known that vg map
is less powerful than a linear aligner like bwa-mem
for read mapping?
2) Are there useful parameters to adjust in vg's mapping tools to increase the # of mapped reads?
Species: Corn (Zea mays)
Genome: B73
SNPs: 5 million high quality SNPs (i.e. filtered to low false positive) from 100+ individuals
Read datasets: 150 bp, paired end, n=200 read datasets, low coverage Illumina DNA sequencing (0.1X, i.e. skim-seq)
Linear aligner: BWA
bwa index -p index_bwa genome.fa
bwa mem index_bwa reads_pe1.fq reads_pe2.fq > linear_align.sam
Genome graph 1: B73 decorated with 5 million SNPs
vg construct -r genome.fa -v snps_5M.vcf > graph_snp.vg
vg view -g graph_snp.vg > graph_snp.gfa #convert to gfa for indexing
Genome graph 2: B73, empty graph
vg construct -r genome.fa > graph_empty.vg
vg view -g graph_empty.vg > graph_empty.gfa #convert to gfa for indexing
Graph aligner 1: vg map
vg autoindex --prefix index_map -w map -g graph.gfa
vg map -x index_map.xg -g index_map.gcsa -f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_map.gaf
Graph aligner 2: vg giraffe
vg autoindex --prefix index_giraffe -w giraffe -g graph.gfa
vg giraffe -Z index_giraffe.giraffe.gbz -m index_giraffe.min -d index_giraffe.dist \
-f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_giraffe.gaf
In this case filtering for quality might help correct for this overattributing feature?