The number of SVs called by `vg call` is much smaller than the number of SVs in the VCF used to construct the graph
0
1
Entering edit mode
15 months ago
Maxine ▴ 50

My objective is to identify the structural variations (SVs) located on chromosome NC_058090.1. The reason for solely focusing on a single chromosome is due to the potential for the vg analysis to freeze if all chromosomes were analyzed simultaneously.

The VCF used to construct the graph is a merged VCF comprising 12 samples. Among these 12 samples, there are approximately 22k SVs located on chromosome NC_058090.1 for each sample. However, when performing the vg call analysis on a new sample, the resulting SV count is merely 9k. Can you assist me in identifying the underlying causes that have led to this situation? Thanks!

Here are the commands:

vg version

vg version v1.50.0 "Monopoli"

vg autoindex

vg autoindex --workflow giraffe -r $ref_genome -v $input_vcf -p div.12bufo.NC_058090.1 -t $SLURM_CPUS_PER_TASK -M ${target_mem}M

vg giraffe

vg giraffe -Z ${gbz_file} -m ${min_file} -d ${dist_file} -f <(cat $read1_file $read2_file) -t $SLURM_CPUS_PER_TASK > $sample_name.NC_058090.1.giraffe.gam

vg filter

vg filter -P -v -t $SLURM_CPUS_PER_TASK $sample_name.NC_058090.1.giraffe.gam > $sample_name.NC_058090.1.giraffe.filtered.gam

vg snarls

vg snarls -t $SLURM_CPUS_PER_TASK div.12bufo.NC_058090.1.giraffe.gbz > div.12bufo.NC_058090.1.giraffe.gbz.snarls

vg pack

vg pack -x ${gbz_file} -g ${gam_file} -o $sample_name.NC_058090.1.gbz.pack -t $SLURM_CPUS_PER_TASK

vg call

vg call ${graph_file} -k ${pack_file} -r ${snarls_file} -t $SLURM_CPUS_PER_TASK -z > ${sample_name}.NC_058090.1.giraffe.vcf
vg • 3.2k views
ADD COMMENT
0
Entering edit mode

One thing to check is how many SVs are private (observed in only one sample) in the input VCF. vg call will rarely call SVs that weren't previously observed (i.e. present in the VCF), so you will probably not identify any of its new private variants.

ADD REPLY
0
Entering edit mode

To Jordan M Eizenga:

Each sample has ~22k SVs in chromosome NC_058090.1.

for sname in $(cat sample_list);do echo $sname; bcftools view --samples div.${sname} -r NC_058090.1 div.12bufo.newfiltered.reRef.vcf.gz | bcftools view -H --exclude 'GT="0/0"' - | wc -l; done

cy201704
22888
cy201804
23104
cy201904
23600
cy202304
23663
cy202404
22749
cy202504
23583
cy206504
22420
cy206604
21729
cy206704
24813
cy207104
22272
cy207204
22269
cy207304
22461
ADD REPLY
0
Entering edit mode

I think that's not quite what I was getting at. The potential scenario I'm suggesting is that only ~1/2 of each sample's SVs are shared with any of the other 11 samples, which could happen if the AFs are mostly pretty low. In that case, your new sample would have ~1/2 of its SVs missing in the pangenome calling results because they hadn't been observed in the previous 12 samples.

To check, I believe you could add the --private flag to the first bcftools view command.

ADD REPLY
0
Entering edit mode

I don't get what is it used for. But here's the results you want:

for sname in $(cat sample_list);do echo $sname; bcftools view --samples div.${sname} -r NC_058090.1 --private div.12bufo.newfiltered.reRef.vcf.gz | wc -l; done
cy201704
955
cy201804
915
cy201904
928
cy202304
932
cy202404
915
cy202504
919
cy206504
844
cy206604
842
cy206704
866
cy207104
868
cy207204
869
cy207304
860
ADD REPLY
0
Entering edit mode

Well, in any case, this rules out the scenario I was thinking of, since it seems private variants are a pretty small proportion of the total.

The next thing I'd look at are the alignments. What do you get from vg stats --alignments alns.gam? Also, what coverage are you expecting in the alignments?

ADD REPLY
0
Entering edit mode

Here's the stats for the gam that went through vg filter

Total alignments: 235995888
Total primary: 235995888
Total secondary: 0
Total aligned: 235995888
Total perfect: 30549490
Total gapless (softclips allowed): 180169057
Total paired: 0
Total properly paired: 0
Alignment score: mean 116.074, median 125, stdev 37.7148, max 160 (30549490 reads)
Mapping quality: mean 19.1125, median 12, stdev 20.154, max 60 (20925284 reads)
Insertions: 91099176 bp in 36968857 read events
Deletions: 195675740 bp in 53079795 read events
Substitutions: 969265341 bp in 969265341 read events
Softclips: 4186078810 bp in 106026710 read events
Total time: 305551 seconds
Speed: 772.362 reads/second

And, the sequencing depth is 14x, so the expecting coverage would be more than 10x I guess?

ADD REPLY
0
Entering edit mode

These alignments have very low mapping qualities, so the low call rate probably has something to do with that. On a typical human sample, I'd expect the mean mapping quality to be > 50. Also, 14x is on the low side, which will also affect the recall.

The low mapping qualities could have different explanations. Is this genome (or even just this contig) notably repetitive? Another explanation could be the sequencing. Maybe these are very short reads? I would also check out FASTQC to make sure nothing looks strange.

ADD REPLY
0
Entering edit mode

The data is paired-end 150bp and almost pass all FASTQC test. So I guess the quality of data is not the main cause. I also have no idea about the complexity of that region. In a normal situation, can i say there are 19 reads mapped on a spot if the mapping quality on that is 19? In other words, would it be helpful if I increase the sequencing depth to like 50?

ADD REPLY
0
Entering edit mode

Mapping quality is a measure of mapping certainty, not of coverage. Each read alignment gets its own mapping quality. The primary cause of low mapping quality is that there is another high-scoring secondary alignment. When that happens, you can't be too sure that you found the correct alignment just because you found a good alignment. Mapping quality is phred scaled, so 20 corresponds to a 1% chance of error.

Do you have repeat annotations for this genome?

ADD REPLY
0
Entering edit mode

I appreciate the explanation you provided. It has brought to mind another matter. Within my workflow, there exists a step wherein I employ the "vg filter" command to selectively remove unmapped reads. However, the number of reads filtered out is fewer than anticipated. NC_058090.1 represents merely the shortest among the 16 chromosomes of this particular organism, yet it has mapped to well over half of the total reads. The filtering outcome is as follows:

Total Filtered:                188225860 / 424221748
Read Name Filter:              0
Subsequence Filter:            0
Proper Pair Filter:            0
Unmapped Filter:               188225860
refpos Contig Filter:          0
Feature Filter:                0
Min Identity Filter:           0
Min Secondary Identity Filter: 0
Max Overhang Filter:           0
Min End Match Filter:          0
Split Read Filter:             0
Repeat Ends Filter:            0
All Defrayed Filter:           0
Min Quality Filter:            0
Min Base Quality Filter:       0
Random Filter:                 0

I have managed to procure the repeat annotations file, which contains a total of 243,000 lines dedicated to the records pertaining to NC_058090.1. Would you like me to forward it to you?

ADD REPLY
0
Entering edit mode

Oh, I missed this, but am I understanding correctly that you mapped whole genome sequencing data to a single-chromosome graph? If so, you're bound to see all sorts of weird behavior. You need the other contigs to be present to soak up their reads. Otherwise, many of them will just find their best alignment on the included contig, which will pollute all the results with a bunch of mismapped reads.

A better procedure would be to map to the full graph and then subset the mapped reads. However, I can't think of a perfect way to do that. One option would be to annotate the read position with vg annotate -p -a and then filter with vg filter -U -X, but you'll probably lose some amount of reads that map to large insertions.

These results also make me think that this genome probably is quite repetitive. One piece of evidence is that ~1/2 the reads found a mapping despite this contig being < 1/16 of the genome, which suggests that > 1/3 of the genome has a similar sequence on this contig. Also, the average score is 116, which is about 75% of the maximum possible score. That means that the reads are not only finding any alignment, but actually a pretty good alignment. If I'm correct, you will probably always have lower payout from short-read sequencing compared to e.g. humans. The pangenome helps get more out of short-read data, but mappability still puts a ceiling on which regions can be accurately assayed.

ADD REPLY
1
Entering edit mode

These results also make me think that this genome probably is quite repetitive.

NC_058090 - Bufo gargarizans Asiatic toad 4.5 Gb genome. 11 chr + MT

From Paper:

Long terminal repeats (LTRs) constitute a high proportion of the genome and their expansion is a key contributor to the inflated genome size in this species. The genome retains a large number of duplicated genes, with tandem (TD) and proximal duplications (PD) the predominant mode of duplication.

ADD REPLY
0
Entering edit mode

Thanks for your advice! You are correct, what I'm doing is to map whole genome sequencing to a chromosome graph.

For the flag -X/--exclude-contig from vg filter, What is the contigs name means? Is it chromsome/scaffold name that exists in refgenome? If so, why some reads mapped to large insertions may lose?

I notice vg annotate is able to annotate graph using annotation file like gff. Would it be helpful to annotate the repeat annontation to graph?

Besides, the reason I spilled up the graph is when I ran vg call on the whole graph, it froze. vg call took 7 days without any waining or error messages and also without any vcf output. So if the VG team is interested in this bug and plans to rectify it, I am more than willing to offer my assistance.

ADD REPLY
0
Entering edit mode

vg annotate -X takes a regular expression to match to the contig name. In this case, you're only trying to match one sequence, so the regex can just be the name itself (i.e. NC_058090.1).

I can't speak to whether projected annotations would be useful for your analyses, but I don't think we need them for troubleshooting the current problem. I was only interested to know if the genome was very repetitive, which GenoMax already confirmed above.

Seven days is a long time for vg call, so it does seem like an issue. I think I found an earlier support request from you about that. If you're able to share the inputs to vg call, I can try to see where it's getting hung up.

ADD REPLY
0
Entering edit mode

vg annotate need xg index, but the graph is in gbz format. How can I get an xg index? Or it is able to compatible with gbz?

ADD REPLY
0
Entering edit mode

Generally speaking, any option in vg that says "xg" or "xg-name" is compatible with GBZ, although sometimes with a difference in speed. The XG-specific option names are legacy from before we supported multiple graph encodings.

ADD REPLY
0
Entering edit mode

update on Aug 30,

Your response is appreciated. I have tried all the possible reasons I could think of, such as increasing memory allocation, removing non-chromosomal level fragments during graph construction to reduce the number of contigs, and so on, but I still cannot resolve this issue.


vg call stucked again when i using NC_058090.1 pack file and the whole reference graph as input. Maybe the frozon of vg call is not due to huge input file. Could you please spare some time to assist me in identifying the issue? I will share the command I used:

  1. I gained whole ref genome graph graph.gbz using vg auindex, snarls file graph.gbz.snarls and whole genome align file ZYZ288A.gam using vg giraffe.

  2. I annotated ZYZ288A.gam and filtered it

    vg annotate -p -t $SLURM_CPUS_PER_TASK -a ZYZ288A.gam -x graph.gbz > ZYZ288A.annotated.gam
    vg filter -X NC_058090.1 -U ZYZ288A.annotated.gam -t 2 > ZYZ288A.NC_058090.1.gam
    

    I checked the quality of ZYZ288A.NC_058090.1.gam and it is good.

    Total alignments: 8321952
    Total primary: 8321952
    Total secondary: 0
    Total aligned: 8321952
    Total perfect: 1242274
    Total gapless (softclips allowed): 6561503
    Total paired: 0
    Total properly paired: 0
    Alignment score: mean 135.154, median 145, stdev 25.771, max 160 (1242274 reads)
    Mapping quality: mean 35.4313, median 44, stdev 25.3901, max 60 (3698817 reads)
    Insertions: 3513817 bp in 1028877 read events
    Deletions: 5769797 bp in 1439898 read events
    Substitutions: 24008381 bp in 24008381 read events
    Softclips: 53953561 bp in 1681591 read events
    Total time: 10375.1 seconds
    Speed: 802.105 reads/second
    
  3. I used vg pack to pack the whole ref graph graph.gbz and the NC_058090.1 gam file ZYZ288A.NC_058090.1.gam
    vg pack -x graph.gbz -g ZYZ288A.NC_058090.1.gam -o ZYZ288A.NC_058090.1.gam.pack -t $SLURM_CPUS_PER_TASK -Q 5 -s 5
    
  4. I used vg call to call variants
    vg call graph.gbz -k ZYZ288A.NC_058090.1.gam.pack -r graph.gbz.snarls -t $SLURM_CPUS_PER_TASK -z > ${vcf_file}
    
    vg call has silenced for 4 hours (no results out, no warning/error msgs out). Just for reference, the previous vg call that utilized a single chromosome graph as input finished in 8 minutes.
ADD REPLY
0
Entering edit mode

If you can provide inputs, we can try to see where it's getting hung up.

ADD REPLY
0
Entering edit mode

Certainly, I'd like to provide them! If you need the graph and all the gam files, they amount to approximately 250GB in total, with the graph being 50GB and the gam files being 200GB. Or do you only need a portion of them?

ADD REPLY
0
Entering edit mode

Since you know which path is causing the issue, you could subset the GAM with vg chunk -C -p

ADD REPLY

Login before adding your answer.

Traffic: 2356 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6