I downloaded a GTF file from UCSC and I observed that all gene_id are identical to gene_names. Considering some genes have the same name but not the same ID and that it causes troubles with some tools, what is the best way to fix this ? I know how to solve the reverse problem (where gene names are gene IDs) by using a Biomart correspondence table with some awk statements, but this case seems trickier as we need to take into account the genes coordinates to define when there are multiple genes with the same name.
# extract of 2 lines of the file galGal6.ncbiRefSeq.gtf.gz - 2 genes with same name (different orientation and coordinates)
chr19 ncbiRefSeq transcript 569498 570362 . - . gene_id "CCL4"; transcript_id "NM_001030360.2"; gene_name "CCL4"; ref_gene_id "CCL4";
chr19 ncbiRefSeq transcript 576493 578185 . + . gene_id "CCL4"; transcript_id "XM_015295666.2"; gene_name "CCL4"; ref_gene_id "CCL4";
When we download the correspondance table from Biomart, we have:
ENSGALG00000032717 CCL4
ENSGALG00000034478 CCL4
When looking in the UCSC genome browser, I know that ENSGALG00000032717 is the first one (coord: 569498-570362) and ENSGALG00000034478 is the second one (576493-578185), so I can correct it to this:
# corrected
chr19 ncbiRefSeq transcript 569498 570362 . - . gene_id "ENSGALG00000032717"; transcript_id "NM_001030360.2"; gene_name "CCL4"; ref_gene_id "CCL4";
chr19 ncbiRefSeq transcript 576493 578185 . + . gene_id "ENSGALG00000034478"; transcript_id "XM_015295666.2"; gene_name "CCL4"; ref_gene_id "CCL4";
Now my question is: is there an automatic way of doing this ?
Thanks for the help !
Is there a reason why you definitely want to use the UCSC annotations? Seems like this would be most easily solved by using the Ensembl annotations.
If you definitely want to use the refseq annotations, as there is no other gene in between the two CCL4 genes, I can't see any automatic way of doing this, without creating code that separates out adjacent transcripts into different genes by looking for clusters of overlapping transcripts.
I am using both annotations as they show some differences and wanted to compare them. So I counted how many gene names that had multiple gene IDs, and we have 56 genes in this situation. Just as a matter of reproducibility, here is how I found these 56 genes:
Data:
Script
cmd.awk
:Output :
With
colGeneNameBiomart
the column number where there is the gene name. But still, changing these 56 genes by hand is not really a solution.There is also the original NCBI GTF file that I could use. It looks like it (same CCL4 example):
And then I can find the correspondances in the biomart table...
That's the only thing that would make sense for me, but I thought it was a more common problem. But still, this works if we use original NCBI gtf instead of UCSC ncbi RefSeq GTF, so we didn't really answer the question in the end. I can't understand why UCSC doesn't use geneID (Ensembl, Entrez, whichever) though...
The UCSC datasets were never really design to be used in analyses I don't think, just for visualization.
It is sort of a problem in Human as well, in that there are genes with the same name in different locations, but at least in that case they are on different chromosomes.
Do you make such a difference with datasets from NCBI vs Ensembl ? Which database do you consider is best ?
The databases are moving more towards each other, at least for humans. But in general, as a rule of thumb, Ensembl is more comprehensive and Refseq is more selective. We usually use Ensembl, but use RefSeq occasionally for particular things where the presence of suspect transcripts in Ensembl is a problem.
?
It doesn't answer the question, I'll edit it to make it clearer