Question

HT-Seq and count matrix

0

Entering edit mode

13 months ago

qudrat.nii ▴ 40

Hello,

I am trying to count reads using HT-Seq. In my gtf file, there are 60603 genes when I use the gene_id option I get all the genes in the count file with reads, but when I use the gene_name option I get only 59055 genes reads in the count file. Please look into this. Thank you

HT-Seq BAM • 798 views

ADD COMMENT • link updated 13 months ago by rpolicastro 13k • written 13 months ago by qudrat.nii ▴ 40

score 2 · Answer 1 · 2023-11-02

2

Entering edit mode

13 months ago

rpolicastro 13k

Generally a gene ID will uniquely map to only one gene name, but a gene name can map to multiple gene IDs. You can also have genes that have only a gene ID and no gene name. Both of these means there are generally less gene names than gene IDs in a GTF/annotation file.

Here's an example from the ensembl v110 human GTF file.

curl -O 'https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.chr.gtf.gz'

# The number of unique gene names.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_name \"[^\"]*\"' | sort -u | wc -l
40871

# The number of unique gene IDs.
gzip -dc Homo_sapiens.GRCh38.110.chr.gtf.gz | grep -o 'gene_id \"[^\"]*\"' | sort -u | wc -l
62700

ADD COMMENT • link 13 months ago by rpolicastro 13k

0

Entering edit mode

what should be the correct way as mapping to gene ID will require to change it to gene name?

ADD REPLY • link 13 months ago by qudrat.nii ▴ 40

1

Entering edit mode

I usually keep them as gene IDs until some visualization or analysis requires a gene name. When this occurs I will either: 1) discard genes without a gene name, or 2) use the gene name unless the gene name is missing or there are more than one gene IDs mapping to the same gene name, in which case I will use the gene ID instead.

ADD REPLY • link 13 months ago by rpolicastro 13k

1

Entering edit mode

I usually do geneID_geneName. If then for visualization I need geneName, it is a simple gsub regex that trims the "_" and everything before it.

ADD REPLY • link 13 months ago by ATpoint 86k