Question

Reduce set of chromosomes in Pangenome graph

0

Entering edit mode

18 months ago

anivlete • 0

Hi,

I was wondering if it is possible with vg to reduce the set of chromosomes in the pangenome graph. It would be analogous to reducing the set of chromosomes in your conventional linear genome reference to be able to do less computationally intensive iterations on research and development.

Thank you for any insights regarding this

pangenome vg • 4.8k views

ADD COMMENT • link updated 17 months ago by Jordan M Eizenga ▴ 650 • written 18 months ago by anivlete • 0

1

Entering edit mode

You might want to look at vg chunk -C

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Thank you!

I want to align short reads to the pangenome and surject back to hg38. I was wondering if you could confirm that the most suitable option for this would be to use the graph file referenced in the draft pangenome paper in the following location:

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg

ADD REPLY • link 18 months ago by anivlete • 0

2

Entering edit mode

It depends which mapping tool you want to use. If you plan to use vg giraffe, the graph should be encoded as a .gbz, or as .gbwt + .gg. I believe the graphs used for the HPRC short read mapping/genotyping experiments were the ones prefixed by hprc-v1.0-mc-grch38-minaf.0.1 here:

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/filtered/

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

From the command used in the HPRC short read mapping/genotyping experiments (extracted from the WDL workflow) I can gather that the necessary files for alignment are: -g (GBWT graph), -H (GBWT index), -x (xg index or graph), -d (distance index), and -m (minimizer index)

In this case which one is the graph? -g or -x? (it seems to me is -x)

I chunked the main pangenome graph used in this study to hold only 2 chromosomes and produced a .xg graph. Do I need to recreate all the indices for this subset or can I use the ones from the parent (full set of chromosomes) graph? And if so what is the best way to go about it?

Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

If -x is the graph file in the alignment command, do I need to create an equivalent -g (GBWT graph) file? So you need two equivalent but differently formatted graph files?

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

For vg giraffe you don't need -x. The required input is .gbz or .gbwt + .gg. I don't know of any command to split up a GBZ/GBWT+GG by chromosome though.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

If I create a .vg graph file with a subset of chromosomes from the .xg pangenome file (using vg chunk) how can I obtain the minimal subset of files necessary to run vg giraffe? I know that from the .vg file I can create the .gbwt index BUT how can I create the corresponding .gg, .dist and .min files?

I don't see a way to vg convert .vg --> .gg

And the creation of index files with vg autoindex seems to be based on a .gfa file

Thank you

ADD REPLY • link 18 months ago by anivlete • 0

1

Entering edit mode

That's because they encode somewhat different information:

.xg and .vg: graph topology using nodes and edges
- .gbwt: a collection of haplotypes expressed as walks through a graph, which can be taken to define the edges of the graph (with an edge between each pair of subsequent steps in the walk)
- .gg: the nodes of a graph
- .gbz: essentially a .gbwt and .gg packaged together in one file

The .xg and .vg can be interconverted freely. A .gbz/.gbwt+.gg can be converted into an .xg or .vg, but the reverse is not possible. .xg and .vg lack the haplotype walks that you need to make the .gbwt.

As far as I know, nobody has written a tool to subset a .gbz/.gbwt+.gg by chromosome.

vg giraffe requires a .gbz/.gbwt+.gg (the graph/haplotypes), a .dist (distance index), and .min (minimizer index). However, the minimal input is just the .gbz/.gbwt+.gg. The other indexes are derived from the graph, and they can be constructed on-the-fly if you don't supply them.

You are also correct that vg autoindex takes GFA input. Its design is intentionally limited to widely-supported, text-based interchange formats like GFA, FASTA, VCF, etc. If you want to modify the graph input to vg autoindex, the way to do it would be to modify the GFA.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Really useful information. Thank you

How about this approach for spliting into a few chroms:

vg chunk the pangenome .xg file w/ .gfa output format per chr
vg combine .gfa chrom files into a single stripped down .gfa
vg autoindex the stripped down .gfa to obtain the input giraffe files

Does this make sense? Do you think this would work or is it some piece of information missing? Thanks

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

I believe that would work, as long as the .xg and .gbz match.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

The .xg and .gbz files are the ones available for the HPRC pangenome short-reads experiments at: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/filtered/

I was assuming they match. Do you happen to know if this is the case?

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

The above approach seems to work for me. After stripping down the graph file, I am able to align (vg giraffe) and vg surject to produce a .bam file

I get the following warnings:

# when aligning:
warning[vg::giraffe]: Encountered 100000 ambiguously-paired reads before finding enough unambiguously-paired reads to learn fragment length distribution. Are you sure your reads are paired and your graph is not a hairball?
warning[vg::giraffe]: Finalizing fragment length distribution before reaching maximum sample size mapped 215 reads single ended with 100000 pairs of reads left unmapped mean: 0, stdev: 1
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
Fragment length distribution: mean=0, stdev=1
Fragment distance limit: 2, read distance limit: 201
warning[vg::giraffe]: Falling back on single-end mapping

# when surjecting
warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.

Since I am not too familiar with this alignment style, I was wondering if you could comment if there are any catastrophic problems I should be aware of. It's a bit odd that it reverts to single-end mapping. Is this because there are too few reads? Does this have important consequences?

One last question is: I noticed that vg giraffe can produce .bam output directly. If this option is used is this equivalent to first aligning to .gam format and then doing vg surject?

Thank you

ADD REPLY • link updated 18 months ago by GenoMax 148k • written 18 months ago by anivlete • 0

1

Entering edit mode

My guess is that you're trying to align whole genome data to a subsetted graph. When you do that, the algorithms tend to perform poorly, because they are designed with the assumption that most reads have a good match somewhere in the graph. You'll have to find some way to subset the read data as well to use this graph. Alternatively, you could subset the read data (e.g. to some coverage level) and then map to the full graph.

For surject, that's probably okay. It's referring to this path metadata model. You can check that it's choosing the right paths (that is, the reference sequence) using paths -G -L on the graph.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

I subsetted the PG graph to chr22 only and as you suggested I simultaneously subsetted the input set of seqs only to chr22

Curiously I get the same warning when aligning and surjecting as before. It is interesting to me that the numbers mentioned in the warnings for both 'Fragment distance limit' and 'read distance limit' are identical indepently of which subset I use.I checked the bam file statistics and all reads seem to be mapped.

I checked the paths used from my subsetted graph w/ paths -G -L and they are the correct ones :>

My question is: Given the above warnings and the fact that all reads seem to be properly mapped, in your experience, can I rely in the surjected alignment or do you think that I should better use and alignment to the original pangenome graph with the full set of chromosomes that is more computationally expensive?

Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

That warning is actually reasonably serious. It's falling back on single-end mapping despite having paired-end reads. There's a significant fraction of reads that will map unambiguously with paired-end mapping but incorrectly/ambiguously with single-end mapping. You'll lose out on those mappings, which often means losing out on specific regions of the genome.

In my experience, the cause for that warning is essentially always a data issue. Some other explanations are that the reads aren't actually paired, or maybe the read pairs are lined up incorrectly. The latter problem is common when converting from a sorted BAM to a FASTQ.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

I checked and reads are actually paired (if aligned w/ a linear ref) >96% are properly paired. They read pairs seem to be lined up correctly (I checked the lists of read ids for R1 and R2 fastq files and they are identical). I assume this is what you mean by lined up correctly?

Any other possibilities you can think of?

A different issue: If I compare the BAM files produced by a linear ref alignment and the full-set-of-chroms pangenome reference, the 'linear' bam has 11% more paired end reads than the 'graph' bam. There were no warnings when aligning w/ the full-set-of-chroms pangenome reference. The vg giraffe alignment command provided in the output a fragment length of 418 bps. Any suggestions on why would the rate of properly paired reads decreased so significantly between the 'linear' and full-set-of-chroms 'pangenome' alignment?

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

I am trying to reproduce the increase in sensitivity of the HPRC short-read alignment experiments and it seems to me that if you loose from the get go all those paired reads (even when using the full-set-of-chroms pangenome reference) that might compromise possible sensitivity gains?

I am executing the same alignment command in that study (from the WDL worflows provided) so I don't think I am introducing any sort of noise that would account for the loss in number of paired-end reads

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

You were absolutely right about the data issue. After much trial and error I ran an alignment with the full set of input sequences (not subsetted) and the subsetted pangenome reference and it worked.

That means that as you mentioned the input sequences were not correctly lined up. As I mentioned before I thought I had checked for this by checking the read ids lining up. But I think I am missing something here. Do you know if there is any utility that properly subets and input .fastq file (in certain regions or chrs) by using a corresponding alignment .bam file?

Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

If you have a subsetted BAM, you can use samtools sort -n to sort the BAM by read name and then seqtk to de-interleave the FASTQ into separate R1 and R2 files. Alternatively, you can provide interleaved paired-end reads to most vg commands with a -i flag.

I suspect the issue is actually that a few read pairs are mapped to different chromosomes, and then subsetting by chromosome removes one read end. I don't know a command off the top of my head that will subset based on whether either read end is aligned to a given chromosome.

ADD REPLY • link 17 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Could you please elaborate more on option1 (samtools sort -n BAM, seqtk)? I am not sure what would you do after sorting the BAM by read name? seqtk acts on fastq files and it does not have a de-interleave option

In option2 using the vg -i flag, it seems vg expects a single file. I tried to create an interleaved version of my R1 and R2 subsetted fastq files (using seqtk) but using this interleaved version also produces the warnings. It seems to me that perhaps the original R1 and R2 subsetted fastq files are not properly interleaved. So that goes back to the original problem of how to properly subset the full chroms R1 and R2 fastq files into a few chromosomes. How would you create the interleaved file to feed to vg giraffe?

Thank you

ADD REPLY • link 17 months ago by anivlete • 0

0

Entering edit mode

I still think the problem is reads that discordantly map across chromosomes. If you have that situation, subsetting by chromosome can remove one read in the pair. After that, there's no hope of properly matched reads. Depending how faithfully the read mapper respects the SAM standard, you might be able to get away with filtering out all discordantly mapped read pairs with samtools view -F.

Re: samtools/seqtk. I forgot to mention another step where you would use samtools fastq to convert the BAM. However, looking at it now, I see that it has an option to split the read pairs over 2 files, which would make seqtk unnecessary.

ADD REPLY • link 17 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

I have tried all sorts of things to create a properly lined up subset of input reads for vg giraffe alignment (filtering by paired map status, by quality, by TLEN, etc) and nothing seems to work.

I guess my question is do you know what exactly is vg giraffe looking for in it's input in terms of lined up reads to try to replicate this?

Thank you

ADD REPLY • link 17 months ago by anivlete • 0

0

Entering edit mode

It expects every read to have a pair, and for the corresponding read-ends to be in exactly the same order across the R1 and R2 FASTQ files. Alternatively, the pairs can be interleaved (R1 followed by R2) within the same FASTQ, but every read still needs to have both pairs.

ADD REPLY • link 17 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

This is an unrelated question but you have been extremely helpful and I was wondering if you could point me to the best forum to ask questions about DeepVariant. I am trying to replicate the improved results for the hprc short reads experiments and need to run DeepVariant from binaries and I am having difficulties.

I tried biostarts but not sure if there's a better place for that purpose.

Thank you

ADD REPLY • link 17 months ago by anivlete • 0

0

Entering edit mode

Sorry, I don't have much direct experience with DeepVariant.

ADD REPLY • link 17 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Also wanted to ask you for a recommendation on the best paper, in your opinion, that reviews at a higher level the different concepts for pangenome graphs (not graphs themselves but related to pangenomes). Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

It's getting a bit dated, but this review is a good intro: https://academic.oup.com/bib/article/19/1/118/2566735

This review is a bit more technical, but also more up-to-date: https://www.annualreviews.org/doi/10.1146/annurev-genom-120219-080406?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Thank you for these references

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

If it is of any use to help illuminate things here are the stats for both the .bam and the .gam files obtained using this approach:

# samtools flagstat NA12877.chr20_21.bam
575430 + 0 in total (QC-passed reads + QC-failed reads)  
0 + 0 secondary  
0 + 0 supplementary  
0 + 0 duplicates  
42189 + 0 mapped (7.33% : N/A).  
575430 + 0 paired in sequencing  
287715 + 0 read1.  
287715 + 0 read2.  
4684 + 0 properly paired (0.81% : N/A).  
12898 + 0 with itself and mate mapped.  
29291 + 0 singletons (5.09% : N/A).  
3008 + 0 with mate mapped to a different chr.  
2309 + 0 with mate mapped to a different chr (mapQ>=5). 

# ../vg stats -a NA12877.chr20_21.gam
Total alignments: 575430.   
Total primary: 575430.   
Total secondary: 0. 
Total aligned: 42459  
Total perfect: 255. 
Total gapless (softclips allowed): 32484.  
Total paired: 575430.  
Total properly paired: 430.  
Alignment score: mean 74.8277, median 68, stdev 33.2027, max 161 (255 reads)  
Mapping quality: mean 21.4316, median 17, stdev 18.4215, max 60 (3698 reads).  
Insertions: 14601 bp in 6487 read events  
Deletions: 18661 bp in 8030 read events.  
Substitutions: 203171 bp in 203171 read events.  
Softclips: 2302971 bp in 44739 read events

ADD REPLY • link updated 18 months ago by GenoMax 148k • written 18 months ago by anivlete • 0

0

Entering edit mode

I am trying to split the reference graph by chromosome using the following commands: vg chunk -x hprc-v1.0-mc-grch38.xg -M -O pg # One chunk per chr vg chunk -x hprc-v1.0-mc-grch38.xg -p chr21 -O pg. # One chunk for chr21

But both commands end with 'Killed' without any output to stderror of log. Any suggestions on how to proceed? Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

Are you running in a resource-constrained compute environment (e.g. a laptop)? Could be you are running out of memory, or maybe running up against a resource-use policy on a shared server.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

Let's say if I just want to extract one or two chromosomes from the main graph with vg chunk how can I gauge the amount of memory needed? Is there any recommendation of minimal resources to be able to handle the pangenome graph for alignment and calling? Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

An .xg format graph takes roughly the same amount of space in RAM as it does on disk, so you should be able to check the file size to get a sense of it.

ADD REPLY • link 18 months ago by Jordan M Eizenga ▴ 650

0

Entering edit mode

That is helpful. Thank you

ADD REPLY • link 18 months ago by anivlete • 0

0

Entering edit mode

Actual commands where:

vg chunk -x hprc-v1.0-mc-grch38.xg -M -O pg

vg chunk -x hprc-v1.0-mc-grch38.xg -p chr21 -O pg

Can I assume that each component is a chromosome for this specific file?

ADD REPLY • link updated 18 months ago by GenoMax 148k • written 18 months ago by anivlete • 0