I have some large (~200kb) contigs which were produced by Illumina sequencing. I want to map these against a human reference genome to identify the genes present on the contigs so their exon sequences can be analysed.
I have exported the region of interest from ensemble into a .fasta and .gff, produced alignment using BWA and tried to view the results in IGV.
The alignment of the reference to the contigs is behaving as expected, but I cannot get the gene information in the .gff file to load in IGV. I have tried renaming all the gff entires IDs to "test" and calling the .fasta header ">test", as well as other combinations but nothing seems to work.
Does anyone know what the issue could be? From what I have seen the .fasta header must match the gff first column exactly, but I have made sure this is the case!
Alternatviely, suggestions for other ways to visualise this information would be useful!
Any help is much appreciated :)
##gff-version 3
##sequence-region 14 1 107349540
test Ensembl gene 106303099 106312010 . - . ID=ENSG00000211898;Name=ENSG00000211898;biotype=IG_C_gene
test Ensembl gene 106320349 106322323 . - . ID=ENSG00000211899;Name=ENSG00000211899;biotype=IG_C_gene
test Ensembl gene 106329408 106329468 . - . ID=ENSG00000211900;Name=ENSG00000211900;biotype=IG_J_gene
test Ensembl gene 106329626 106329675 . - . ID=ENSG00000237111;Name=ENSG00000237111;biotype=IG_J_pseudogene
test Ensembl gene 106330024 106330072 . - . ID=ENSG00000242472;Name=ENSG00000242472;biotype=IG_J_gene
test Ensembl gene 106330425 106330470 . - . ID=ENSG00000240041;Name=ENSG00000240041;biotype=IG_J_gene
[EDIT]I think I have found the answer; when you export from ensemble, the numbering in the gff file for the features uses the numbering from the whole chromosome, but IGV seems to count from the start of the fasta file from 1, which means the numbering is all out.
Is there a way to get IGV to respect numbering in the fasta header? I can't find information on that from their manual. Otherwise I'll have to wite something quickly to subject the starting number from each entry in the gff.
can you post a few lines of your gff file?
I've added the first few lines, the .fasta file starts ">test"\n
I don't see anything wrong with the lines you posted. Try loading just one line of your file and make the start/end span a large distance just so you can obviously visualize it. See if that works.
Very strange, if I just load the top line, with a huge start/end span, it loads perfectly.
I suppose a binary search of the file is in order.
I don't understand what you mean. Are you saying the reference sequence fasta file you loaded in is in pieces? So each chromosome is broken up into multiple fasta entries?
I am not familiar with IGV, but the GFF3 doesn't make much sense. Is there a reference sequence named "test" defined somewhere? The only reference sequence in your snippet is (presumably) for chromosome 14, so I would expect "14" to be in the first column.
there is no binary search for the gff files, it will be loaded as a whole. What often happens is that the the seqid columns do not match, therefore the GFF features cannot be shown. From what I see you have renamed the seqid column to test, that does not seem to be right.
as the previous poster points it out, check the pragmas (lines with the ##). Remove the second pragma and check that way. Then add it back but make sure it matches. In fact I am not quite sure what the purpose of this second pragma is.