I have the following gtf file layout, with the 'features' transcript (i.e. full length of the transcript) and the exons within that transcript. For example:
C7123483 cam transcript 1 8268 . + . gene_id "00001"; transcript_id "00001";
C7123483 cam exon 1 206 . + . gene_id "00001"; transcript_id "00001";
C7123483 cam exon 263 749 . + . gene_id "00001"; transcript_id "00001";
Since this file only contains the coordinates for the exons, I would also like this file to include the intron coordinates. Presumably I would have to subtract the end coordinate of the previous exon from the start coordinate of the next exon. Has anyone got any experience doing this - are there any tools to do this automatically as I am struggling to write a script?
I need to find the exon/intron coordinates as I have another bed file whose coordinates I need to match with the exon/intron/trasncript_id/gene_id information from the gtf file.
I hope this makes sense - I am very new to bioinformatics, and any help would be very much appreciated.
Look into bedtools complement. Assuming you have only exons in your files this may work. Then you can use bedtools merge to merge the two files, if you need this information in a single file.
I would suggest you to look into MISO annotation for all the possible intron annotated since it might be that between two exons an intron is not annotated as "intron" but rather can be any potential regulatory sequence (5UTR, 3UTR, snRNA etc not yet annotated..). In order to be an intron you need evidence that it is annotated based on the intron/exon junction (i.e. its expression is dependent on the flanking exons). Have a look here:
https://miso.readthedocs.io/en/fastmiso/annotation.html
furthermore you can use http://rnaseqlib.readthedocs.org/ and make you own annotation