I found that the virus genomes distributed in GenBank format by the PaVE site are not parsed correctly by the genbankr
Bioconductor package, because there are no spaces after the "ORIGIN" field. Either the package is too strict, or the files are missing the spaces, but I could not find a specification of the GenBank format that would tell me which is right and which is wrong. Is that part standardised ?
Edited (26/10/2017):
Here is a minimal example:
LOCUS TEST 240 bp DNA linear VRL 13-Sep-2017 DEFINITION Human papillomavirus 16 (HPV16), complete genome ACCESSION TEST VERSION TEST.1 GI:123456 KEYWORDS . SOURCE TEST ORGANISM TEST . FEATURES Location/Qualifiers source 1..240 gene 42..56 /gene="L2" ORIGIN 1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg 61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca 121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat 181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc
Here is what happens when importing it (after loading the rtracklayer
and genbankr
packages).
> genes(import("test.gbk")) Annotations don't have 'locus_tag' label, using 'gene' as gene_id column No exons read from genbank file. Assuming sections of CDS are full exons GRanges object with 1 range and 4 metadata columns: seqnames ranges strand | type <rle> <iranges> <rle> | <character> [1] unk [42, 56] + | gene gene <character> [1] L2ORIGIN1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc loctype <character> [1] normal gene_id <character> [1] L2ORIGIN1 actacaataa ttcatgtata aaactaaggg cgtaaccgaa atcggttgaa ccgaaaccgg61 ttagtataaa agcagacatt ttatgcacca aaagagaact gcaatgtttc aggacccaca121 ggagcgaccc agaaagttac cacagttatg cacagagctg caaacaacta tacatgatat181 aatattagaa tgtgtgtact gcaagcaaca gttactgcga cgtgaggtat atgactttgc ------- seqinfo: 1 sequence from TEST.1 genome
Just sed 's/ORIGIN.*/ORIGIN/'
and the problem will be solved.
> genes(import("test2.gbk")) Annotations don't have 'locus_tag' label, using 'gene' as gene_id column No exons read from genbank file. Assuming sections of CDS are full exons GRanges object with 1 range and 4 metadata columns: seqnames ranges strand | type gene loctype <rle> <iranges> <rle> | <character> <character> <character> [1] unk [42, 56] + | gene L2 normal gene_id <character> [1] L2 ------- seqinfo: 1 sequence from TEST.1 genome
Not shown above, but it looks like the symptoms only appear when the last feature before ORIGIN is a gene.
Hi Charles,
Can you provide a working example? Perhaps we can edit the
genbankr
function so that it can read these files correctly.Kevin
Thanks Kevin for your kind offer (my gut feeling is that PaVE is wrong and genbankr is right) I added example data to the original post.
Yes, it looks like /gene="L2" is causing the issue, perhaps the forward-slash in particular.
I reckon that it will still be easier to modify the function
genbankr
and create a workaround, as opposed to getting PaVE to change their submissions. I typically report these things to the developers too (in this case, whoever wrotegenbankr
). As systems get updated, programs accessing them typically have to update too!