My objective is to identify the structural variations (SVs) located on chromosome NC_058090.1. The reason for solely focusing on a single chromosome is due to the potential for the vg analysis to freeze if all chromosomes were analyzed simultaneously.
The VCF used to construct the graph is a merged VCF comprising 12 samples. Among these 12 samples, there are approximately 22k SVs located on chromosome NC_058090.1 for each sample. However, when performing the vg call analysis on a new sample, the resulting SV count is merely 9k. Can you assist me in identifying the underlying causes that have led to this situation? Thanks!
Here are the commands:
vg version
vg version v1.50.0 "Monopoli"
vg autoindex
vg autoindex --workflow giraffe -r $ref_genome -v $input_vcf -p div.12bufo.NC_058090.1 -t $SLURM_CPUS_PER_TASK -M ${target_mem}M
vg giraffe
vg giraffe -Z ${gbz_file} -m ${min_file} -d ${dist_file} -f <(cat $read1_file $read2_file) -t $SLURM_CPUS_PER_TASK > $sample_name.NC_058090.1.giraffe.gam
vg filter
vg filter -P -v -t $SLURM_CPUS_PER_TASK $sample_name.NC_058090.1.giraffe.gam > $sample_name.NC_058090.1.giraffe.filtered.gam
vg snarls
vg snarls -t $SLURM_CPUS_PER_TASK div.12bufo.NC_058090.1.giraffe.gbz > div.12bufo.NC_058090.1.giraffe.gbz.snarls
vg pack
vg pack -x ${gbz_file} -g ${gam_file} -o $sample_name.NC_058090.1.gbz.pack -t $SLURM_CPUS_PER_TASK
vg call
vg call ${graph_file} -k ${pack_file} -r ${snarls_file} -t $SLURM_CPUS_PER_TASK -z > ${sample_name}.NC_058090.1.giraffe.vcf
One thing to check is how many SVs are private (observed in only one sample) in the input VCF.
vg call
will rarely call SVs that weren't previously observed (i.e. present in the VCF), so you will probably not identify any of its new private variants.To Jordan M Eizenga:
Each sample has ~22k SVs in chromosome NC_058090.1.
I think that's not quite what I was getting at. The potential scenario I'm suggesting is that only ~1/2 of each sample's SVs are shared with any of the other 11 samples, which could happen if the AFs are mostly pretty low. In that case, your new sample would have ~1/2 of its SVs missing in the pangenome calling results because they hadn't been observed in the previous 12 samples.
To check, I believe you could add the
--private
flag to the firstbcftools view
command.I don't get what is it used for. But here's the results you want:
Well, in any case, this rules out the scenario I was thinking of, since it seems private variants are a pretty small proportion of the total.
The next thing I'd look at are the alignments. What do you get from
vg stats --alignments alns.gam
? Also, what coverage are you expecting in the alignments?Here's the stats for the gam that went through
vg filter
And, the sequencing depth is 14x, so the expecting coverage would be more than 10x I guess?
These alignments have very low mapping qualities, so the low call rate probably has something to do with that. On a typical human sample, I'd expect the mean mapping quality to be > 50. Also, 14x is on the low side, which will also affect the recall.
The low mapping qualities could have different explanations. Is this genome (or even just this contig) notably repetitive? Another explanation could be the sequencing. Maybe these are very short reads? I would also check out FASTQC to make sure nothing looks strange.
The data is paired-end 150bp and almost pass all FASTQC test. So I guess the quality of data is not the main cause. I also have no idea about the complexity of that region. In a normal situation, can i say there are 19 reads mapped on a spot if the mapping quality on that is 19? In other words, would it be helpful if I increase the sequencing depth to like 50?
Mapping quality is a measure of mapping certainty, not of coverage. Each read alignment gets its own mapping quality. The primary cause of low mapping quality is that there is another high-scoring secondary alignment. When that happens, you can't be too sure that you found the correct alignment just because you found a good alignment. Mapping quality is phred scaled, so 20 corresponds to a 1% chance of error.
Do you have repeat annotations for this genome?
I appreciate the explanation you provided. It has brought to mind another matter. Within my workflow, there exists a step wherein I employ the "vg filter" command to selectively remove unmapped reads. However, the number of reads filtered out is fewer than anticipated. NC_058090.1 represents merely the shortest among the 16 chromosomes of this particular organism, yet it has mapped to well over half of the total reads. The filtering outcome is as follows:
I have managed to procure the repeat annotations file, which contains a total of 243,000 lines dedicated to the records pertaining to NC_058090.1. Would you like me to forward it to you?
Oh, I missed this, but am I understanding correctly that you mapped whole genome sequencing data to a single-chromosome graph? If so, you're bound to see all sorts of weird behavior. You need the other contigs to be present to soak up their reads. Otherwise, many of them will just find their best alignment on the included contig, which will pollute all the results with a bunch of mismapped reads.
A better procedure would be to map to the full graph and then subset the mapped reads. However, I can't think of a perfect way to do that. One option would be to annotate the read position with
vg annotate -p -a
and then filter withvg filter -U -X
, but you'll probably lose some amount of reads that map to large insertions.These results also make me think that this genome probably is quite repetitive. One piece of evidence is that ~1/2 the reads found a mapping despite this contig being < 1/16 of the genome, which suggests that > 1/3 of the genome has a similar sequence on this contig. Also, the average score is 116, which is about 75% of the maximum possible score. That means that the reads are not only finding any alignment, but actually a pretty good alignment. If I'm correct, you will probably always have lower payout from short-read sequencing compared to e.g. humans. The pangenome helps get more out of short-read data, but mappability still puts a ceiling on which regions can be accurately assayed.
NC_058090 - Bufo gargarizans Asiatic toad 4.5 Gb genome. 11 chr + MT
From Paper:
Thanks for your advice! You are correct, what I'm doing is to map whole genome sequencing to a chromosome graph.
For the flag
-X/--exclude-contig
fromvg filter
, What is the contigs name means? Is it chromsome/scaffold name that exists in refgenome? If so, why some reads mapped to large insertions may lose?I notice
vg annotate
is able to annotate graph using annotation file like gff. Would it be helpful to annotate the repeat annontation to graph?Besides, the reason I spilled up the graph is when I ran vg call on the whole graph, it froze.
vg call
took 7 days without any waining or error messages and also without any vcf output. So if the VG team is interested in this bug and plans to rectify it, I am more than willing to offer my assistance.vg annotate -X
takes a regular expression to match to the contig name. In this case, you're only trying to match one sequence, so the regex can just be the name itself (i.e. NC_058090.1).I can't speak to whether projected annotations would be useful for your analyses, but I don't think we need them for troubleshooting the current problem. I was only interested to know if the genome was very repetitive, which GenoMax already confirmed above.
Seven days is a long time for
vg call
, so it does seem like an issue. I think I found an earlier support request from you about that. If you're able to share the inputs tovg call
, I can try to see where it's getting hung up.vg annotate
need xg index, but the graph is in gbz format. How can I get an xg index? Or it is able to compatible with gbz?Generally speaking, any option in
vg
that says "xg" or "xg-name" is compatible with GBZ, although sometimes with a difference in speed. The XG-specific option names are legacy from before we supported multiple graph encodings.update on Aug 30,
Your response is appreciated. I have tried all the possible reasons I could think of, such as increasing memory allocation, removing non-chromosomal level fragments during graph construction to reduce the number of contigs, and so on, but I still cannot resolve this issue.
vg call
stucked again when i using NC_058090.1 pack file and the whole reference graph as input. Maybe the frozon ofvg call
is not due to huge input file. Could you please spare some time to assist me in identifying the issue? I will share the command I used:I gained whole ref genome graph
graph.gbz
using vg auindex, snarls filegraph.gbz.snarls
and whole genome align fileZYZ288A.gam
using vg giraffe.I annotated
ZYZ288A.gam
and filtered itI checked the quality of
ZYZ288A.NC_058090.1.gam
and it is good.graph.gbz
and the NC_058090.1 gam fileZYZ288A.NC_058090.1.gam
vg call
has silenced for 4 hours (no results out, no warning/error msgs out). Just for reference, the previous vg call that utilized a single chromosome graph as input finished in 8 minutes.If you can provide inputs, we can try to see where it's getting hung up.
Certainly, I'd like to provide them! If you need the graph and all the gam files, they amount to approximately 250GB in total, with the graph being 50GB and the gam files being 200GB. Or do you only need a portion of them?
Since you know which path is causing the issue, you could subset the GAM with
vg chunk -C -p