Question

IGV uses gene names as chromosome and the annotations file does not work

0

Entering edit mode

3.2 years ago

giammafer ▴ 20

Dear all, I am trying to upload the Human Cytomegalovirus genome on IGV.

My purpose is to have all the viral genes in a straight line. The virus has one chromosome and I would like to visualise the entire set of genes in the same IGV screen.

Moreover, I will map proteins against this reference genome using their genomic coordinates. For this reason, my final visualisation would be the HCMV genome on top and all the proteins found in the omics analysis on the track below.

Input files

I used the FASTA and GFF3 files with IGV 2.12.2

I downloaded both from this link Human herpesvirus 5 strain Merlin, complete genome - Nucleotide - NCBI from the section “Send to” and set the menu as in the annexed screenshots (file attached to this email):

Human herpesvirus 5 strain Merlin, complete genome - Nucleotide - NCBI

enter image description here To download the Protein Nucleotides sequences in FASTA format

enter image description here To download Protein Annotations in GFF3 format

ISSUE 1

Original input FASTA headers are as follow:

lcl|NC_006273.2_cds_YP_081455.1_1 [gene=RL1] [locus_tag=HHV5wtgp001] [db_xref=GeneID:3077430] [protein=protein RL1] [protein_id=YP_081455.1] [location=1367..2299] [gbkey=CDS]

I loaded the FASTA file as a reference genome through the menu “Genome” —> “Load Genome. This is the outcome:

enter image description here

I supposed that the main reason for this result was related to the fact that the viral genome lacked a chromosome tag. As a result, IGV treated each protein as a chromosome that appeared in the relative dropdown menu.

I tried also to put a number ‘1’ at the beginning of each FASTA header to allocate all the protein sequences on a “dummy” chromosome 1. Unfortunately, it does not work because in this case IGV no longer recognises the FASTA headers.

ISSUE 2

I provided the gene annotations in GFF3 format, through the menu “File” —--> “Load from File”

enter image description here

The features do not appear in the visualisation.

Moreover, I tried to create a dummy .bed track in order to understand if something could have been mapped against the uploaded FASTA. Also in this case nothing happened.

I think that something lack in the sequence ontology between the FASTA and GFF3 that impaired the connection between the sequences and their features. Furthermore, I think that because in this case, the protein names became the chromosomes, an element uploaded through the .bed track will lack position on the genome and avoid IGV to visualise.

I tested this case with the bed file that contained only this row: lcl|NC_006273.2_cds_YP_081455.1_1 1367 2299 YP_081455 1000 + 1367 2299 255,255,0 1 932 0

Could you please suggest where I am wrong and a possible solution?

Best regards.

Giammarco

sequence ontology IGV • 1.5k views

ADD COMMENT • link 3.2 years ago by giammafer ▴ 20

0

Entering edit mode

Hi, could you provide a few lines from your GFF3 file? Also did you make sure that scaffold/chromosome names are the same between genome and the GTF file? Also your FASTA seems to be for the genes, are you sure you have the whole genome there?

ADD REPLY • link 3.2 years ago by danvoronov ▴ 30

0

Entering edit mode

Hi, these are the first lines of the annotations in GFF3 format:

##sequence-region NC_006273.2 1 235646 ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10359 NC_006273.2 RefSeq region 1 235646 . + . ID=NC_006273.2:1..235646;Dbxref=taxon:10359;acronym=HHV-5%3B HCMV;collected-by=Gavin W.G. Wilkinson;collection-date=1999;country=United Kingdom: Cardiff;culture-collection=ATCC:VR-1590;gbkey=Src;genome=genomic;isolation-source=urine from a congenitally infected child;mol_type=genomic DNA;nat-host=Homo sapiens;note=originally named strain 742%3B passaged 3 times in human fibroblasts%3B gene UL128 is mutated%3B populations of mutants in RL13 predominate%2C but the consensus sequence of this gene is not mutated;old-name=Human herpesvirus 5;strain=Merlin

NC_006273.2 RefSeq inverted_repeat 1 1324 . + . ID=id-NC_006273.2:1..1324;Note=TRL;gbkey=repeat_region;rpt_type=inverted NC_006273.2 RefSeq repeat_region 1 578 . + . ID=id-NC_006273.2:1..578;Note='a' sequence;gbkey=repeat_region;rpt_type=terminal NC_006273.2 RefSeq gene 1324 2386 . + . ID=gene-HHV5wtgp001;Dbxref=GeneID:3077430;Name=RL1;gbkey=Gene;gene=RL1;gene_biotype=protein_coding;locus_tag=HHV5wtgp001 NC_006273.2 RefSeq mRNA 1356 2386 . + . ID=rna-HHV5wtgp001;Parent=gene-HHV5wtgp001;Dbxref=GeneID:3077430;experiment=Northern blot,RACE;gbkey=mRNA;gene=RL1;locus_tag=HHV5wtgp001;product=protein RL1 NC_006273.2 RefSeq exon 1356 2386 . + . ID=exon-HHV5wtgp001-1;Parent=rna-HHV5wtgp001;Dbxref=GeneID:3077430;experiment=Northern blot,RACE;gbkey=mRNA;gene=RL1;locus_tag=HHV5wtgp001;product=protein RL1 NC_006273.2 RefSeq CDS 1367 2299 . + 0 ID=cds-YP_081455.1;Parent=rna-HHV5wtgp001;Dbxref=Genbank:YP_081455.1,GeneID:3077430;Name=YP_081455.1;Note=RL1 family;gbkey=CDS;gene=RL1;locus_tag=HHV5wtgp001;product=protein RL1;protein_id=YP_081455.1

About the scaffold I can say that each row in the GFF3 starts with the strain version (NC_006273.2) in the scaffold position of this format. While in the FASTA file the same reference is merged with protein codes and the "lcl" tag which is the NCBI identifier for the sequences that do not have specific database references. For instance, the header that I reported in the original message starts like that lcl|NC_006273.2_cds_YP_081455.1_1 . I do not think that this situation is ideal for IGV but I supposed that the two files downloaded from NCBI would have contained coherent data between each other. However, I did not see any indication about the chromosome. If I am right, this problem is due to the fact that is a viral genome. In this case the DNA is organised in a unique molecule (no chromosomes). Maybe this is another problem for IGV.

—-----------------------------------------------------------------------------------------------------------------------------------

Yes you are right, I used the FASTA that contains gene sequences. In my first attempt I tried with another FASTA file that contained the whole genome in a unique DNA strand. These are the first rows:

>NC_006273.2 Human herpesvirus 5 strain Merlin, complete genome CCATTCCGGGCCGTGTGCTGGGTCCCCGAGGGGCGGGGGGGTGTTTTCTGCGGGGGGGTGAA ATTTGGAGTTGCGTGTGTGGACGGCGACGGCGACTAGTTGCGTGTGCTGCGGTGGGTACGGCG ACGGCGAATAAAAGCGACGTGCGGCGCGCACGGCGAAAAGCAGACGCGCGTCTGTGTCTGTTT GAGTCCCCAGGGGACGGCAGCGCGGGTCCTTGGGGACACACGCAAAACAACGGCCAGACAAG

Nevertheless, when I added the GFF3 with the annotations nothing happened. The annotations did not appear and I supposed that it wasn’t the right way to combine the sequence with the annotations.

For this reason, I tried with the FASTA gene sequences but was unsuccessful.

ADD REPLY • link 3.2 years ago by giammafer ▴ 20

0

Entering edit mode

Hi, I searched for the following on NCBI:

Human herpesvirus 5 strain Merlin, complete genome - Nucleotide - NCBI

and downloaded the genome as FASTA and also the GFF3 with features, as per your screenshots. I loaded them in IGV, the genome using Genomes > Load from File and the GFF3 annotation using File > Load from File. The GFF3 annotations are visible. I am afraid I cannot reproduce your problem.

NC_006273.2 Human herpesvirus 5 strain Merlin, complete genome

This is the genome, not just gene sequences file.

If you used fasta with the following headers:

lcl|NC_006273.2_cds_YP_081455.1_1

as your genome fasta in IGV it will not work because Genome fasta needs to have the same chromosome/scaffold ID as first word in the header as in the GFF3 file. Also mind that GFF3 files gives the locations on the genome, so if you use CDS fasta they are much shorter than genome and many features on the GFF3 which is genome associated will be outside of the ranges of lengths given by the CDS/gene fasta file, in addition to different header.

ADD REPLY • link 3.2 years ago by danvoronov ▴ 30

1

Entering edit mode

Thank you very much danvoronov for your support.

Now I think it works:

enter image description here

I was wrong in uploading the protein sequences FASTA file as a reference nucleotides sequence instead of the Complete Genome FASTA sequence. Indeed, I was wrong in naming the two files and I was confused between them. Apologise for my terrible mistake!!!!!! I hope it did not waste much of your time.

ADD REPLY • link 3.2 years ago by giammafer ▴ 20