CGAT: Error while running gtf2gtf
1
0
Entering edit mode
8.3 years ago

I have gff file downloaded from NCBI and it looks like this

enter image description here

I have used the cufflinks binary gffread to covert this into a gtf file using ./gffread Hgenome_test.gff -T -o Hgenome_test.gtf

The generated gtf file looks like the following:

enter image description here

Now, I am interested in selecting the longest of the overlapping genes. I tried to achieve this using cgat gtf2gtf -I ./Hgenome_test.gtf --method=filter --filter-method=longest-gene. However, when I am trying this, I am getting the following error message:

enter image description here

Could someone please tell me what is going wrong and how can I resolve it?

cgat gtf2gtf • 3.0k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

This is because not all of the entries of gene10 are listed next to each other in the gtf file - gffread is outputting a different sort order to that expected by gtf2gtf. Amongst the many things that gtf2gtf can do, one is sort in a variety of sort orders, the default being to sort on gene_id (and then genomic position). This can be achieved like so, piping the sorted results staight back into gtf2gtf for the filtering:

cgat gtf2gtf -I ./Hgenome_test.gtf --method=sort | cgat gtf2gtf --method=filter --filter-method=longest-gene
ADD COMMENT
0
Entering edit mode

Thank you Dr. Sudbery for your reply and the suggestion. However, I am still stuck with it. I am sorry to seek advice at every step like this. Below is a screenshot of the current error message:

enter image description here

ADD REPLY
0
Entering edit mode

So the gtf file produced by gffread from the ncbi gff is not a valid GTF file. In particular some of the lines are missing gene_id attributes, which are required by the GTF specification, and obviously are required to sort the file into gene order. I tried this out myself, there are a few hundred entries of something called "curated genomic" exons, whatever they are, that don't contain gene_ids. They seem to be mostly immunogloblin genes, and I know that ensembl treats Ig genes in a special way, I don't know about RefSeq. This kind of makes sense - Ig genes arn't proper genes until they are rearranged during lymphocyte development, but are just isolated exons, but without a gene_id, the record isn't a valid GTF and can't be sorted on gene. The easiest way to deal with this is to filter them out with grep:

 grep gene_id Hgenome_test.gtf > Hgenome_filtered.gtf

I tired this, I was able to successfully then sort and filter the resulting file.

ADD REPLY
0
Entering edit mode

Thank you so much Dr. Sudbery. This time the commands indeed ran flawlessly without any errors.

ADD REPLY
0
Entering edit mode

However, in my case the output thus generated does still contain overlapping genes. (Line 12 onwards)

Line 12 onwards

ADD REPLY
0
Entering edit mode

These records are not overlapping genes - they are all exons from the same gene.

ADD REPLY
0
Entering edit mode

Yes, that's correct. Sorry for the blunder. Thank you so much Dr. Sudbery.

ADD REPLY
0
Entering edit mode

If you want a single line per gene (for easy counting of the number of genes), you could run the merge-transcripts method of gtf2gtf to get a single entry running from the start of the frist exon of each gene to the end of the last one.

Alternatively the gtf2table tool will output a tsv file that contains the name of each gene along with its chromosome and start and end co-ordinates:

cgat gtf2table -I Hgenome_non-overlapping.gtf --counter=position
ADD REPLY
0
Entering edit mode

Thank you Dr. Sudbery. Your suggestion worked absolutely flawlessly. However, I am facing another problem regrading the chromosome number. This appears on the first line at the beginning of each chromosome in the gff file. What is mentioned at the beginning each row is the RefSeq accession ID. enter image description here

This information regarding the chromosome number is getting lost while converting the gff file to gtf file.I have corresponded with NCBI regarding this and there seems to be no straightforward way to translate the accession IDs to chromosome numbers. Is there any way to preserve this information while converting the gff file to file, e.g. replace the accession IDs with corresponding chromosome numbers?

ADD REPLY
0
Entering edit mode

So I don't know a good way to do this. But to get a table of assembly to chromosome number you could use the following:

awk '$3=="region"' Hgenome_test.gff  |  sed -E 's/(N._[0-9]+\.[0-9]+).+chromosome=([^;]+);.+/\1\tchr\2/g' | sort -u > refseq2chr.tsv
ADD REPLY

Login before adding your answer.

Traffic: 2206 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6