I have gff file downloaded from NCBI and it looks like this
I have used the cufflinks binary gffread to covert this into a gtf file using ./gffread Hgenome_test.gff -T -o Hgenome_test.gtf
The generated gtf file looks like the following:
Now, I am interested in selecting the longest of the overlapping genes. I tried to achieve this using cgat gtf2gtf -I ./Hgenome_test.gtf --method=filter --filter-method=longest-gene
. However, when I am trying this, I am getting the following error message:
Could someone please tell me what is going wrong and how can I resolve it?
Thank you Dr. Sudbery for your reply and the suggestion. However, I am still stuck with it. I am sorry to seek advice at every step like this. Below is a screenshot of the current error message:
So the gtf file produced by gffread from the ncbi gff is not a valid GTF file. In particular some of the lines are missing gene_id attributes, which are required by the GTF specification, and obviously are required to sort the file into gene order. I tried this out myself, there are a few hundred entries of something called "curated genomic" exons, whatever they are, that don't contain gene_ids. They seem to be mostly immunogloblin genes, and I know that ensembl treats Ig genes in a special way, I don't know about RefSeq. This kind of makes sense - Ig genes arn't proper genes until they are rearranged during lymphocyte development, but are just isolated exons, but without a gene_id, the record isn't a valid GTF and can't be sorted on gene. The easiest way to deal with this is to filter them out with grep:
I tired this, I was able to successfully then sort and filter the resulting file.
Thank you so much Dr. Sudbery. This time the commands indeed ran flawlessly without any errors.
However, in my case the output thus generated does still contain overlapping genes. (Line 12 onwards)
These records are not overlapping genes - they are all exons from the same gene.
Yes, that's correct. Sorry for the blunder. Thank you so much Dr. Sudbery.
If you want a single line per gene (for easy counting of the number of genes), you could run the merge-transcripts method of gtf2gtf to get a single entry running from the start of the frist exon of each gene to the end of the last one.
Alternatively the gtf2table tool will output a tsv file that contains the name of each gene along with its chromosome and start and end co-ordinates:
Thank you Dr. Sudbery. Your suggestion worked absolutely flawlessly. However, I am facing another problem regrading the chromosome number. This appears on the first line at the beginning of each chromosome in the gff file. What is mentioned at the beginning each row is the RefSeq accession ID.
This information regarding the chromosome number is getting lost while converting the gff file to gtf file.I have corresponded with NCBI regarding this and there seems to be no straightforward way to translate the accession IDs to chromosome numbers. Is there any way to preserve this information while converting the gff file to file, e.g. replace the accession IDs with corresponding chromosome numbers?
So I don't know a good way to do this. But to get a table of assembly to chromosome number you could use the following: