Entering edit mode
17 months ago
hxt
•
0
**I am trying to use giraffe to map paried-end short-reads against my pan-genoome graph,I met some warning(below).
Could you tell me how I can do ?**
warning[vg::Watchdog]: Thread 16 has been checked in for 10 seconds processing: ERR626576.4149, ERR626576.4149
warning[vg::Watchdog]: Thread 17 has been checked in for 10 seconds processing: ERR626576.7602, ERR626576.7602
warning[vg::Watchdog]: Thread 18 has been checked in for 10 seconds processing: ERR626576.6088, ERR626576.6088
warning[vg::Watchdog]: Thread 20 has been checked in for 10 seconds processing: ERR626576.12421, ERR626576.12421
warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: ERR626576.29760, ERR626576.29760
warning[vg::Watchdog]: Thread 13 has been checked in for 10 seconds processing: ERR626576.27768, ERR626576.27768
warning[vg::Watchdog]: Thread 0 has been checked in for 10 seconds processing: ERR626576.332252, ERR626576.332252
warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: ERR626576.59052, ERR626576.59052
warning[vg::Watchdog]: Thread 21 has been checked in for 10 seconds processing: ERR626576.70060, ERR626576.70060
warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: ERR626576.75464, ERR626576.75464
warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: ERR626576.86319, ERR626576.86319
warning[vg::Watchdog]: Thread 19 has been checked in for 10 seconds processing: ERR626576.85706, ERR626576.85706
warning[vg::Watchdog]: Thread 22 has been checked in for 10 seconds processing: ERR626576.108047, ERR626576.108047
warning[vg::Watchdog]: Thread 13 finally checked out after 21 seconds and 56384 kb memory growth processing: ERR626576.27768, ERR626576.27768
warning[vg::Watchdog]: Thread 14 has been checked in for 10 seconds processing: ERR626576.126421, ERR626576.126421
This is just a warning, not an error. Did you get an error in the end ?
Thank you for your replay!
No error was eventually reported because the command didn't end up running through.
I run this conmand vg giraffe -t 24 -Z 63-pg.d2.gbz -d 63-pg.d2.dist -m 63-pg.d2.min -f IRIS_313-7684_1.QC.fastq.gz -f IRIS_313-7684_2.QC.fastq.gz > IRIS_313-7684_mapped.gam
When I first started running this command, the size of the gam file kept increasing, but after the warning message appeared , the size of the gam file stayed the same, which lasted all night, so I terminated the command.I don't think vg giraffe should last very long.
I would expect vg giraffe to be slow. For me, it's almost impossible to use. The only efficient, so fast, short read aligner for pangenomes is minigraph in my experience, and that sadly only works on rGFA and not GFA pangenomes to my knowledge.
Speed depends on the following
head -n 4000000 R1.fastq > small_R1.fastq
Your reply is very useful! I think the problem is most likely the third reason above
My pan-genome is built using minigraph-cactus pipeline.I have two pan-genomes, one built with 63 genomes and the other with 5.When I run the vg giraffe command using the pangenome built with 5 genomes, the command runs successfully in a relatively short time without any warning messages. But when using the pan-genome built with 63 genomes, the above warning message appears.I want to run the vg giraffe command using the pangenome built with 63 genomes, what should I do
Thanks!
Often the hard part is graph construction. It's easy to build a graph that represents some alignment between the sequences, but what you really want is a graph (an alignment) that is useful in the intended application.
For Giraffe, that means avoiding complex structures at every level of detail while simultaneously trying to minimize sequence duplication. The Minigraph–Cactus pipeline achieves that, at least if the sequences resemble human genomes and there are not too many of them. For a new species, graph construction can be a major project until you find the right way to do it.
Minigraph–Cactus produces several versions of the graph. With a few haplotypes (like 5), you want to use the default (
clip
) graph for mapping. With more haplotypes (like 63), it's better to use thefilter
graph that drops every node visited by only one haplotype.There is also an experimental option of using a sample-specific subgraph that consists of a few synthetic haplotypes. You start from the
clip
graph, count kmers in the reads, and generate a few synthetic haplotypes that should be similar to the sequenced sample. With human genomes, this approach is faster and produces better variant calling results than thefilter
graph. Nobody has tried it with other species yet, and we are not confident that the current approach handles structural variants properly.I don't think it's going to work to be honest with a large complex pangenome. Like I say, I have had big problems with performance trying to map short reads to plant pangenomes.
I don't think short read mapping is a solved problem yet for pangenomics, others are looking at using non-mapping tools like pangenie.