I am trying to count reads using HT-Seq. In my gtf file, there are 60603 genes when I use the gene_id option I get all the genes in the count file with reads, but when I use the gene_name option I get only 59055 genes reads in the count file. Please look into this. Thank you
Generally a gene ID will uniquely map to only one gene name, but a gene name can map to multiple gene IDs. You can also have genes that have only a gene ID and no gene name. Both of these means there are generally less gene names than gene IDs in a GTF/annotation file.
Here's an example from the ensembl v110 human GTF file.
curl -O 'https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.chr.gtf.gz'
# The number of unique gene names.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_name \"[^\"]*\"' | sort -u | wc -l
40871
# The number of unique gene IDs.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_id \"[^\"]*\"' | sort -u | wc -l
62700
I usually keep them as gene IDs until some visualization or analysis requires a gene name. When this occurs I will either: 1) discard genes without a gene name, or 2) use the gene name unless the gene name is missing or there are more than one gene IDs mapping to the same gene name, in which case I will use the gene ID instead.
what should be the correct way as mapping to gene ID will require to change it to gene name?
I usually keep them as gene IDs until some visualization or analysis requires a gene name. When this occurs I will either: 1) discard genes without a gene name, or 2) use the gene name unless the gene name is missing or there are more than one gene IDs mapping to the same gene name, in which case I will use the gene ID instead.
I usually do
geneID_geneName
. If then for visualization I needgeneName
, it is a simplegsub
regex that trims the "_" and everything before it.