Entering edit mode
17 months ago
anivlete
•
0
Hi,
I was wondering if it is possible with vg to reduce the set of chromosomes in the pangenome graph. It would be analogous to reducing the set of chromosomes in your conventional linear genome reference to be able to do less computationally intensive iterations on research and development.
Thank you for any insights regarding this
You might want to look at
vg chunk -C
Thank you!
I want to align short reads to the pangenome and surject back to hg38. I was wondering if you could confirm that the most suitable option for this would be to use the graph file referenced in the draft pangenome paper in the following location:
https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg
It depends which mapping tool you want to use. If you plan to use
vg giraffe
, the graph should be encoded as a.gbz
, or as.gbwt
+.gg
. I believe the graphs used for the HPRC short read mapping/genotyping experiments were the ones prefixed byhprc-v1.0-mc-grch38-minaf.0.1
here:https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/filtered/
From the command used in the HPRC short read mapping/genotyping experiments (extracted from the WDL workflow) I can gather that the necessary files for alignment are: -g (GBWT graph), -H (GBWT index), -x (xg index or graph), -d (distance index), and -m (minimizer index)
In this case which one is the graph? -g or -x? (it seems to me is -x)
I chunked the main pangenome graph used in this study to hold only 2 chromosomes and produced a .xg graph. Do I need to recreate all the indices for this subset or can I use the ones from the parent (full set of chromosomes) graph? And if so what is the best way to go about it?
Thank you
If -x is the graph file in the alignment command, do I need to create an equivalent -g (GBWT graph) file? So you need two equivalent but differently formatted graph files?
For
vg giraffe
you don't need-x
. The required input is.gbz
or.gbwt
+.gg
. I don't know of any command to split up a GBZ/GBWT+GG by chromosome though.If I create a .vg graph file with a subset of chromosomes from the .xg pangenome file (using vg chunk) how can I obtain the minimal subset of files necessary to run vg giraffe? I know that from the .vg file I can create the .gbwt index BUT how can I create the corresponding .gg, .dist and .min files?
I don't see a way to vg convert .vg --> .gg
And the creation of index files with vg autoindex seems to be based on a .gfa file
Thank you
That's because they encode somewhat different information:
.xg
and.vg
: graph topology using nodes and edges.gbwt
: a collection of haplotypes expressed as walks through a graph, which can be taken to define the edges of the graph (with an edge between each pair of subsequent steps in the walk).gg
: the nodes of a graph.gbz
: essentially a.gbwt
and.gg
packaged together in one fileThe
.xg
and.vg
can be interconverted freely. A.gbz
/.gbwt
+.gg
can be converted into an.xg
or.vg
, but the reverse is not possible..xg
and.vg
lack the haplotype walks that you need to make the.gbwt
.As far as I know, nobody has written a tool to subset a
.gbz
/.gbwt
+.gg
by chromosome.vg giraffe
requires a.gbz
/.gbwt
+.gg
(the graph/haplotypes), a.dist
(distance index), and.min
(minimizer index). However, the minimal input is just the.gbz
/.gbwt
+.gg
. The other indexes are derived from the graph, and they can be constructed on-the-fly if you don't supply them.You are also correct that
vg autoindex
takes GFA input. Its design is intentionally limited to widely-supported, text-based interchange formats like GFA, FASTA, VCF, etc. If you want to modify the graph input tovg autoindex
, the way to do it would be to modify the GFA.Really useful information. Thank you
How about this approach for spliting into a few chroms:
Does this make sense? Do you think this would work or is it some piece of information missing? Thanks
I believe that would work, as long as the
.xg
and.gbz
match.The .xg and .gbz files are the ones available for the HPRC pangenome short-reads experiments at: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/freeze/freeze1/minigraph-cactus/filtered/
I was assuming they match. Do you happen to know if this is the case?
The above approach seems to work for me. After stripping down the graph file, I am able to align (vg giraffe) and vg surject to produce a .bam file
I get the following warnings:
Since I am not too familiar with this alignment style, I was wondering if you could comment if there are any catastrophic problems I should be aware of. It's a bit odd that it reverts to single-end mapping. Is this because there are too few reads? Does this have important consequences?
One last question is: I noticed that vg giraffe can produce .bam output directly. If this option is used is this equivalent to first aligning to .gam format and then doing vg surject?
Thank you
My guess is that you're trying to align whole genome data to a subsetted graph. When you do that, the algorithms tend to perform poorly, because they are designed with the assumption that most reads have a good match somewhere in the graph. You'll have to find some way to subset the read data as well to use this graph. Alternatively, you could subset the read data (e.g. to some coverage level) and then map to the full graph.
For surject, that's probably okay. It's referring to this path metadata model. You can check that it's choosing the right paths (that is, the reference sequence) using
paths -G -L
on the graph.I subsetted the PG graph to chr22 only and as you suggested I simultaneously subsetted the input set of seqs only to chr22
Curiously I get the same warning when aligning and surjecting as before. It is interesting to me that the numbers mentioned in the warnings for both 'Fragment distance limit' and 'read distance limit' are identical indepently of which subset I use.I checked the bam file statistics and all reads seem to be mapped.
I checked the paths used from my subsetted graph w/ paths -G -L and they are the correct ones :>
My question is: Given the above warnings and the fact that all reads seem to be properly mapped, in your experience, can I rely in the surjected alignment or do you think that I should better use and alignment to the original pangenome graph with the full set of chromosomes that is more computationally expensive?
Thank you
That warning is actually reasonably serious. It's falling back on single-end mapping despite having paired-end reads. There's a significant fraction of reads that will map unambiguously with paired-end mapping but incorrectly/ambiguously with single-end mapping. You'll lose out on those mappings, which often means losing out on specific regions of the genome.
In my experience, the cause for that warning is essentially always a data issue. Some other explanations are that the reads aren't actually paired, or maybe the read pairs are lined up incorrectly. The latter problem is common when converting from a sorted BAM to a FASTQ.
I checked and reads are actually paired (if aligned w/ a linear ref) >96% are properly paired. They read pairs seem to be lined up correctly (I checked the lists of read ids for R1 and R2 fastq files and they are identical). I assume this is what you mean by lined up correctly?
Any other possibilities you can think of?
A different issue: If I compare the BAM files produced by a linear ref alignment and the full-set-of-chroms pangenome reference, the 'linear' bam has 11% more paired end reads than the 'graph' bam. There were no warnings when aligning w/ the full-set-of-chroms pangenome reference. The vg giraffe alignment command provided in the output a fragment length of 418 bps. Any suggestions on why would the rate of properly paired reads decreased so significantly between the 'linear' and full-set-of-chroms 'pangenome' alignment?
I am trying to reproduce the increase in sensitivity of the HPRC short-read alignment experiments and it seems to me that if you loose from the get go all those paired reads (even when using the full-set-of-chroms pangenome reference) that might compromise possible sensitivity gains?
I am executing the same alignment command in that study (from the WDL worflows provided) so I don't think I am introducing any sort of noise that would account for the loss in number of paired-end reads
You were absolutely right about the data issue. After much trial and error I ran an alignment with the full set of input sequences (not subsetted) and the subsetted pangenome reference and it worked.
That means that as you mentioned the input sequences were not correctly lined up. As I mentioned before I thought I had checked for this by checking the read ids lining up. But I think I am missing something here. Do you know if there is any utility that properly subets and input .fastq file (in certain regions or chrs) by using a corresponding alignment .bam file?
Thank you
If you have a subsetted BAM, you can use
samtools sort -n
to sort the BAM by read name and thenseqtk
to de-interleave the FASTQ into separate R1 and R2 files. Alternatively, you can provide interleaved paired-end reads to mostvg
commands with a-i
flag.I suspect the issue is actually that a few read pairs are mapped to different chromosomes, and then subsetting by chromosome removes one read end. I don't know a command off the top of my head that will subset based on whether either read end is aligned to a given chromosome.
Could you please elaborate more on option1 (samtools sort -n BAM, seqtk)? I am not sure what would you do after sorting the BAM by read name? seqtk acts on fastq files and it does not have a de-interleave option
In option2 using the vg -i flag, it seems vg expects a single file. I tried to create an interleaved version of my R1 and R2 subsetted fastq files (using seqtk) but using this interleaved version also produces the warnings. It seems to me that perhaps the original R1 and R2 subsetted fastq files are not properly interleaved. So that goes back to the original problem of how to properly subset the full chroms R1 and R2 fastq files into a few chromosomes. How would you create the interleaved file to feed to vg giraffe?
Thank you
I still think the problem is reads that discordantly map across chromosomes. If you have that situation, subsetting by chromosome can remove one read in the pair. After that, there's no hope of properly matched reads. Depending how faithfully the read mapper respects the SAM standard, you might be able to get away with filtering out all discordantly mapped read pairs with
samtools view -F
.Re: samtools/seqtk. I forgot to mention another step where you would use
samtools fastq
to convert the BAM. However, looking at it now, I see that it has an option to split the read pairs over 2 files, which would makeseqtk
unnecessary.I have tried all sorts of things to create a properly lined up subset of input reads for vg giraffe alignment (filtering by paired map status, by quality, by TLEN, etc) and nothing seems to work.
I guess my question is do you know what exactly is vg giraffe looking for in it's input in terms of lined up reads to try to replicate this?
Thank you
It expects every read to have a pair, and for the corresponding read-ends to be in exactly the same order across the R1 and R2 FASTQ files. Alternatively, the pairs can be interleaved (R1 followed by R2) within the same FASTQ, but every read still needs to have both pairs.
This is an unrelated question but you have been extremely helpful and I was wondering if you could point me to the best forum to ask questions about DeepVariant. I am trying to replicate the improved results for the hprc short reads experiments and need to run DeepVariant from binaries and I am having difficulties.
I tried biostarts but not sure if there's a better place for that purpose.
Thank you
Sorry, I don't have much direct experience with DeepVariant.
Also wanted to ask you for a recommendation on the best paper, in your opinion, that reviews at a higher level the different concepts for pangenome graphs (not graphs themselves but related to pangenomes). Thank you
It's getting a bit dated, but this review is a good intro: https://academic.oup.com/bib/article/19/1/118/2566735
This review is a bit more technical, but also more up-to-date: https://www.annualreviews.org/doi/10.1146/annurev-genom-120219-080406?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed
Thank you for these references
If it is of any use to help illuminate things here are the stats for both the .bam and the .gam files obtained using this approach:
I am trying to split the reference graph by chromosome using the following commands: vg chunk -x hprc-v1.0-mc-grch38.xg -M -O pg # One chunk per chr vg chunk -x hprc-v1.0-mc-grch38.xg -p chr21 -O pg. # One chunk for chr21
But both commands end with 'Killed' without any output to stderror of log. Any suggestions on how to proceed? Thank you
Are you running in a resource-constrained compute environment (e.g. a laptop)? Could be you are running out of memory, or maybe running up against a resource-use policy on a shared server.
Let's say if I just want to extract one or two chromosomes from the main graph with vg chunk how can I gauge the amount of memory needed? Is there any recommendation of minimal resources to be able to handle the pangenome graph for alignment and calling? Thank you
An
.xg
format graph takes roughly the same amount of space in RAM as it does on disk, so you should be able to check the file size to get a sense of it.That is helpful. Thank you
Actual commands where:
Can I assume that each component is a chromosome for this specific file?