Hi everyone,
I'm trying to map my data to the genome graph provided by the human pangenome project on github (https://github.com/human-pangenomics/hpp_pangenome_resources). While doing that I've run into several problems which ultimatley have led me here.
I want to do transcriptomics with my data so from my understanding I need to use vgmpmap und for that I need these files: graph.vg (splice aware) graph.gcsa graph.dist
Before mapping I need the indicies and make sure my graph is "splice aware". On the github I can get a graph.gfa (I used the mc-cactus-graph based of HG38 as a reference), a variants.vcf and a graph.dist. But the other files I have to create myself which according to git, vg autoindex should do + it returns a splice aware graph. I ran vg autoindex with the following command:
vg autoindex --workflow mpmap --prefix hprc-v1.1-mc-grch38_new. -g hprc-v1.1-mc-grch38.gfa --vcf hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf --tx-gff hg38-annotation.gtf
Using the annotation from NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/).
And here is where I run into the error: ERROR: Chromosome path "NC_000001.11" not found in graph or haplotype index (line 6)
The problem appears to be the annotation used. So I also tried using the annotations from gencode (https://www.gencodegenes.org/human/) and ucsc (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/). Resulting in the same error (instead of "NC_000001.11" the error said "chr1"). Which reaffirms my belief that the names used in the annotations from ncbi/gencode/Ucsc do not align with the names in the graph.
My problem basically boils down to two questions, and aswering either should allow me to map: What annotation do I need to autoindex the provided graph so I can get all the files (graph.vg, graph.gcsa, graph.dist) needed for mapping?
Is the graph provided on git splice aware (because I can't see anything on git indicating that it is or isn't)? And if so can I just use vg convert to turn the graph.gfa to a graph.vg and then follow the basic steps to get a graph.gcsa to then map with?
I'm using vg version 1.56
The GFA you are working with expresses the reference sequences as a haplotype of a specific sample "GRCh38". In this case,
vg autoindex
expects the contig names from the GTF to be expressed using the PanSN naming specification. PanSN identifiers consist of three fields: sample, haplotype, and contig. For a reference, there is really only one "haplotype", so we use a placeholder value for the second field. The full sequence names look likeGRCh38#0#chr1
. If you modify your GTF to match this format, it should be accepted byvg autoindex
.A further note: the VCF that you are providing will not be used in this indexing pipeline.
vg
does not currently have a fully developed method to add variants from a VCF into a general graph.Thank you for the reply. Changing the contig names did infact solve the error. For anyone having the same problem in the future the exact code I used to fix it was:
This command prepends "GRCh38#0#" to the beginning of each line starting from line 6 (as anything before that are file identifiers in the gencode annotation, if you are using another annottion please check the file) to the end of the file with in the file "example.gtf". 6,$ specifies the range of lines to operate on, with 6 being the starting line and $ is the end of the file. The ^ character matches the beginning of a line. The -i flag tells sed to apply these changes to the input file directly.
Considering the vcf file I'm a bit confused. I took that from the hpp_pangenome_recources git repo and I need a vcf to run autoindex. If the vcf is not used in the indexing pipeline, why is it required for indexing?
The VCF should not be required.
vg autoindex
supports creating indexes from several different data sources. In practice, most graphs constructed either by 1) adding variants from a VCF onto a reference sequence, or 2) specialized graph construction tools that generate a GFA. Accordingly, the input formulas are generally either GFA or VCF+FASTA. The HPRC VCFs were generated by decomposing a GFA graph into variants.As it's currently implemented,
vg autoindex
prefers to use the GFA input, and GFA+VCF still satisfies the minimum data requirements to make a GFA-based index. However, the VCF will just be ignored. We should probably add a warning when some of the inputs aren't being used.Hi, thank you for your help again. The error no longer occurs. and I could remove the vcf part from the code making it easier to read. However, the function still does not work entirely (due to what seem to be memory issues?) since these two errors are not related, I have oppened a new issue here
Any further help would be greatly appreciated