Question

Lacking exons in gtf file of a virus' genome

0

Entering edit mode

3 months ago

ZuelTech • 0

Hi! I have downloaded viral genome (HAZARA virus) in NCBI virus. I opened the folder but no gtf file content, only gbff file. So, I converted the gbff file to gff file, finally from gff file to gtf file. It seems that there are no exon lines in the 3rd column of the gtf file, only CDS and transcripts.

Therefore, when I performed indexing and maping using STAR, I used --sjdbGTFfeatureExon CDS \ in my scripts. In indexing, I generated results with no error, other files have file sizes, except for TAB files (particularly sjdbList.out.tab and sjdbList.fromGTF.out.tab) with 0 file size. So when I peformed mapping after indexing, it seemed that genome index is incompatible since the tab files have 0 file size.

Could you help me how to run STAR indexing and mapping successfully? Is the viral genome info lack exons?

gtf exon Mapping Hazara VirusGenome • 1.3k views

ADD COMMENT • link updated 3 months ago by colindaven 7.7k • written 3 months ago by ZuelTech • 0

0

Entering edit mode

If no proper annotations are present in terms of GTF, is there a transciptome fasta available? Then you could quantify with something like salmon, maybe even using the genome as a decoy. See salmon docs on details. At minimum it needs a transcriptome fasta. It can use a genome as decoy, meaning it will check whether for any potential transcript alignment there is a better match in the genome to ensure DNA contaminations are not incorrectly quantified.

ADD REPLY • link 3 months ago by ATpoint 88k

0

Entering edit mode

No transcriptome available. I'm particularly using a virus genome. I need to perform indexing and mapping using STAR. Is there a way to know how to obtain a gtf file with exon information? so that STAR can read it when doing mapping.

ADD REPLY • link 3 months ago by ZuelTech • 0

0

Entering edit mode

You can find the GTF file here --> https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/831/085/GCF_002831085.1_ASM283108v1/

Be sure to get the corresponding genome sequence file as well (so everything matches).

ADD REPLY • link 3 months ago by GenoMax 152k

0

Entering edit mode

Thank you. But I think the one you provided is from Refseq (as evident from the file name “GCF…”, mine I used the genome from GenBank with the names “GCA…”

ADD REPLY • link 3 months ago by ZuelTech • 0

score 1 · Answer 1 · 2025-03-31

1

Entering edit mode

3 months ago

Michael 56k

If there are no exons, that means that the annotators didn't consider splicing relevant. This is common for bacteria and viruses. Therefore, you don't need to take exons into account. You don't even need to use a splicing-aware aligner or if you do, you don't need to provide the GTF/GFF file when building the genome. The tab file is empty because there are no junctions to be had (sjdb = splice junction db). Anyway, this is a very small genome with only 3 genes placed on 3 different scaffolds with minimal flanking sequence, so STAR is complete overkill here.

Btw.: This is not supposed to mean that viruses or bacteria don't have introns, however rare. Just that there are none in this virus. Here's an exception: https://doi.org/10.1371/journal.ppat.1004164

ADD COMMENT • link 3 months ago by Michael 56k

0

Entering edit mode

Thank you! I’m new to bioinformatics, particularly on working with virus. What can you recommend on what to do with mapping with the reads to HAZV genome using STAR? Does it mean I don’t need to put a gtf file when I do mapping to HAZV genome?

ADD REPLY • link 3 months ago by ZuelTech • 0

0

Entering edit mode

Does it mean I don’t need to put a gtf file when I do mapping to HAZV genome?

Indeed, unless you want to generate the count file in STAR. Or you can simply use minimap2. When counting, you can either count genes/CDS or even hits to the contigs. This is an RNA virus, and the contigs likely represent mature vRNA transcripts. Therefore, I believe quantifying based on the contig may be easiest and most correct. So simply count the number of reads mapping the contigs and compare. This is of course specific to this virus, so if you want to make a more general pipeline, you should quantify on the gene level.

ADD REPLY • link 3 months ago by Michael 56k

0

Entering edit mode

I see, thank you so much! I’ll try this method. Can I also use Salmon in quantification?

ADD REPLY • link 3 months ago by ZuelTech • 0

0

Entering edit mode

Yes, that you can do as well. Then you don't need to use STAR. But I would add the host transcriptome as a decoy there.

ADD REPLY • link 3 months ago by Michael 56k

0

Entering edit mode

Thank you! Btw, going back to mapping, I’m working with a vector infected with HAZV and LGTV virus. I have libraries with the host infected with HAZV and LGTV, with a control (uninfected). My end goal is to know the gene expression of the host infected with these viruses. So, first, I mapped the libraries to the host genome.

Then, I want to map the libraries to each of the virus genome. Is this possible? Which libraries should I map to the HAZV genome? Or shall I map all the libraries to each virus genome?

ADD REPLY • link 3 months ago by ZuelTech • 0

0

Entering edit mode

Then, I want to map the libraries to each of the virus genome. Is this possible?

Possible but will that be logical. If you have all three genomes present in RNA then you should map to all three at the same time. There will be reads that will multi-map across genomes because of sequence similarity.

In light of this information, you may want to go the salmon route with the three transcriptomes combined in one set (along with genome decoys).

ADD REPLY • link 3 months ago by GenoMax 152k

0

Entering edit mode

I mean, can I map all lists of the libraries ( HAZV-infected, LGTV-infected, uninfected) to the HAZV genome? Can I also map them to the LGTV genome? Or map only the HAZV-infected libraries to the HAZV genome, and LGTV-infected libraries to the LGTV genome?

ADD REPLY • link 3 months ago by ZuelTech • 0

0

Entering edit mode

Since the host is always going to be there you can do the mapping to a combined three transcriptome pool, as a test/first-pass. That will give you an idea if there are sequences that are aligning to HAZV or LGTV, even when it was not present in the experiment. Ideally you will see no counts for the transcriptome that is absent and counts for the other two that were present.

If you do see significant counts for a transcriptome that should be absent (a few counts here and there may be ok otherwise it shows that HAZV/LGTV have cross-reacting sequences) then you will need to resort to doing independent host+HAZV and host+LGTV transcriptome alignments.

Aligning to just HAZV or LGTV is not a good idea when you know that the samples contain host derived sequences.

ADD REPLY • link 3 months ago by GenoMax 152k

score 0 · Answer 2 · 2025-04-01

I think this is a conversion problem rather than a desire of the annotators to not consider splicing. gbff is not a typically used or nice annotation format (aside from at the NCBI?).

I would add the exons in (just give them the same coordinates as the CDS in your Gff3 or gtf files, and unique names). Check genomes on Ensembl to see what modern gtf or gff3 formats look like with respect to naming, exon and CDS.