Dear All, I have been struggling with PopGenome for a while and I ran out of ideas. I have data in VCF format with SNPs from several fragments of a few genes, a reference fasta with the same names as in VCF, and GTF like:
1 VCF CDS 68 124 . + 2 gene_id "Gene.100265";
2 VCF CDS 126 405 . + 1 gene_id "Gene.100265";
3 VCF CDS 447 820 . + 1 gene_id "Gene.100265";
4 VCF CDS 864 1078 . + 1 gene_id "Gene.100265";
The genes are sequenced in a few hundred individuals. Each gene was sequenced in a few fragments and the fragments are the same for all individuals.
I do not know to make PopGenome include information on coding sequence. I used:
PGfile <- PopGenome::readData("./vcf/",gffpath = "./gtf/", format = "VCF", include.unknown = TRUE)
PGfile <- set.synnonsyn(PGfile, ref.chr=paste0("./FASTA/references_uncoded_246_.fasta"))
PGfile@region.data@CodingSNPS are all TRUE, but PGfile@region.data@ExonSNPS are all FALSE. For some reason I get only a fraction of information from PGfile@region.data@codons and PGfile@region.data@n.nucleotides which anyway is NULL.
I guess this might be a problem about gtf file format but found no clue on how it should be formatted for PopGenome. I used "Gene", "Exon" and "CDS" (as above) for the third column, but nothing has changed.
Does anyone have idea what I did wrong?
Cheers,
Maciek KonopiĆski
The session info is as follows (all necessary packages are loaded):
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
PopGenome_2.7.5
EDIT I have added a lot of redundant information to gff file and now it looks as follows:
Gene.138885 . Region 1 369 . + . ID=Gene.138885;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA
Gene.138885 . mRNA 1 369 . + . ID=Gene.138885_mRNA;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA
Gene.138885 . gene 1 369 . + . ID=Gene.138885_gene;Name=Gene.138885;chromosome=Gene.138885;genome=chromosome;mol_type=genomic DNA
Gene.138885 . exon 264 343 . + . ID=Gene.1388851exon
Gene.138885 . CDS 264 343 . + 1 ID=Gene.1388851CDS
I reached another level of problems - set.synnonsyn
returns error:
Error in START[!REV, 3] : incorrect number of dimensions
This error affects regions where only one CDS is present. In set.synnonsyn
START contains a vector that has no dimensions while the function requires two-dimensional table. I will try to fix it manually, but I am a bit affraid it will have further consequences...