Question

vg: Mapping fewer reads to a genome graph than a linear reference

3

Entering edit mode

23 months ago

John Lloyd ▴ 30

I am seeing a 5-10% reduction in reads mapped to a genome graph compared to a linear reference of the same genome sequence. I have two graphs: one is the reference genome decorated with 5 million SNPs and the other is an empty graph created with vg using the reference genome sequence alone. Regardless of which graph or graph aligner I use (details below), I'm seeing similar results.

1) Is it unexpected to see fewer reads map to a genome graph than the linear reference, or am I misunderstanding? For example, is it known that vg map is less powerful than a linear aligner like bwa-mem for read mapping?

2) Are there useful parameters to adjust in vg's mapping tools to increase the # of mapped reads?

Species: Corn (Zea mays)

Genome: B73

SNPs: 5 million high quality SNPs (i.e. filtered to low false positive) from 100+ individuals

Read datasets: 150 bp, paired end, n=200 read datasets, low coverage Illumina DNA sequencing (0.1X, i.e. skim-seq)

Linear aligner: BWA

bwa index -p index_bwa genome.fa
bwa mem index_bwa reads_pe1.fq reads_pe2.fq > linear_align.sam

Genome graph 1: B73 decorated with 5 million SNPs

vg construct -r genome.fa -v snps_5M.vcf > graph_snp.vg
vg view -g graph_snp.vg > graph_snp.gfa #convert to gfa for indexing

Genome graph 2: B73, empty graph

vg construct -r genome.fa > graph_empty.vg
vg view -g graph_empty.vg > graph_empty.gfa #convert to gfa for indexing

Graph aligner 1: vg map

vg autoindex --prefix index_map -w map -g graph.gfa 
vg map -x index_map.xg -g index_map.gcsa -f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_map.gaf

Graph aligner 2: vg giraffe

vg autoindex --prefix index_giraffe -w giraffe -g graph.gfa
vg giraffe -Z index_giraffe.giraffe.gbz -m index_giraffe.min -d index_giraffe.dist \
    -f reads_pe1.fq -f reads_pe2.fq --gaf > graph_align_giraffe.gaf

giraffe mapping pangenome vg • 1.3k views

ADD COMMENT • link updated 22 months ago by samuel.a.odonnell ▴ 590 • written 23 months ago by John Lloyd ▴ 30

score 1 · Answer 1 · 2023-05-26

1

Entering edit mode

23 months ago

colindaven 7.4k

I learned recently from others in pangenomics who have done mapping with bwa that bwa tends to overattribute reads to a reference sequence. Bowtie2 does not, apparently, and this might be a better comparison

So your results are not unexpected for me.

That said, I find pangenome mappers to be in their infancy (at least in terms of speed in my hands), which has been admitted by pangenome tool devs, with the exception of GraphAligner for long reads.

ADD COMMENT • link 23 months ago by colindaven 7.4k

0

Entering edit mode

In this case filtering for quality might help correct for this overattributing feature?

ADD REPLY • link 22 months ago by samuel.a.odonnell ▴ 590