Dear all, I am trying to create index file using the splice site and exon information from the .gtf file. However, using the hisat2 python commands in Ubuntu, I get empty output files. I verified hisat2 command with Saccharomyces gtf file and it produces respective files properly. In conclusion, I concluded that my .gtf file may have some format differences with the Saccharomyces gtf file. So I compared both but it seems correct in format too.
Here is a preview of my .gff3 file:
chloro . exon 5158 6606 . - . ID=Ljchlorog3v0000040.1.exon.1;Parent=Ljchlorog3v0000040.1;sequencetype=Protein coding
chloro . CDS 5158 6585 . - 0 ID=Ljchlorog3v0000040.1.CDS.1;Parent=Ljchlorog3v0000040.1;sequencetype=Protein coding
I feel that the description in the last column may have something to do with the splice site information extraction. However, for splice site info, exon number description in the last column maybe needed. I wonder if instead of exon.1 in the last column, I need to state it as "exon number = 1", for the splice site information extraction using the hisat2 python command.
FYI: I extracted the exons file by simply selecting exon rows and subtracting 1 from the positions in column 4 & 5.
Please help. Regards.
Dear Istavan, Sure, the problem is the non-availability of .gtf files in our organism database and so I had to get it converted from .gff3 to .gtf. I have already tried this .gtf file with the usegalaxy.org server for RNA-Seq and it seems to work properly.
**Here is a preview of my .gtf file
And, I have already tried this .gtf file as well with the python script and does not seem to work (sorry for not mentioning it in the original post). It still returns a blank output. The same command, however, works properly for the Saccharomyces .gtf file. Does it need to mention the exon number in the last column? So I am lost as to why the python script is not able to extract splice site and exon information from thisfile while it works nicely for the yeast one. In the end, for the exon file output, the exon positions are mentioned properly in this file as well.
Thank you and regards, Debatosh Das.
The script here
https://github.com/infphilo/hisat2/blob/master/hisat2_extract_splice_sites.py
is a fairly simple line-oriented processing, you should be able to read through it and see what it does. It appears that it only requires the
exon
to be in the type thentranscript_id
andgene_id
to be listed.Add a few
print()
commands here and there and you can follow through what the tool does. As far a tool troubleshooting goes this is an easier case.It is likely that the
junctions
variable is empty at line 88 and that's why you don't get any output. Now why that is empty, is anybody's guess. No one can really troubleshoot this without seeing your entire file or at least the first few columns that do list a full transcript.Dear Istvan, 1) Thank you for replying so promptly. I will surely try print commands to see if it yields something... 2) I did not get your suggestion of checking junctions variable at line 88? You mean splice junctions? 2) As for the file contents, I will try to paste first few columns of Lotus japonicus.gtf (converted from .gff3 downloaded from LOTUS BASE website) and Saccharomyces.gtf (downloaded from ENSEMBL website).
Preview of Lotus japonicus.gtf (column-wise contents, each row has been separated by space while pasting):
Preview of Saccharomyces cerevisiae .gtf (column-wise contents, each row has been separated by space while pasting):
Both files seem to have the same contents and structured in a similar way. What I don't understand is using the python script which you have also mentioned in your reply above, I get a properly written output for the yeast .gtf but not for Lotus one?
Thank you. Regards, Debatosh Das.