Question

List of human protein coding genes with given name (known function?)

0

Entering edit mode

5.6 years ago

Adrian Pelin ★ 2.7k

Hello,

To put it simply, I am doing differential expression analysis on human RNA-seq data and I want to focus my analysis of genes that are: 1) Protein coding, so no SNOR or MIR 2) Genes with a given name (I see genes like AC006064.5 and I am not sure what they are)

I am hoping that by refocusing my analysis on a smaller set of genes I would get more hits as FDR seems to be impacted by sample size (amount of genes tested).

Any advice on sub-sampling a .gtf file to protein coding genes only?

Thanks

RNA-Seq Annotation Human • 3.2k views

ADD COMMENT • link updated 3.6 years ago by Shicheng Guo ★ 9.6k • written 5.6 years ago by Adrian Pelin ★ 2.7k

score 2 · Answer 1 · 2019-08-24

The GTF file should typically contain annotation information on the gene type. Select for example only protein_coding. That is how it is called in the GENCODE files, not sure what the exact term would be in RefSeq GTF. I typically remove a couple of RNA species before doing DEG, which is all smallRNAs (micro, sno, sn..., because they are not well-captured in standard RNA-seq and therefore not reliable) and TECs (to be experimentally confirmed) genes. What is left is (at least imho) the meaningful part of a standard RNA-seq experiment which then goes into dispersion estimation and model fitting. What you could probably do is to remove further gene types after the main DEG but prior to multiple testing correction if you have the feeling that you lose too many significant genes being eliminated by FDR. The FDR is definitely affected by the number of tested genes, and reducing the number of tests might make sense. From a "strategical" standpoint you could probably remove everything from which you a priori know that you will not benefit from finding these genes significant. That would be genes such as non-protein coding transcripts of (unknown) function. Finding these as significant might potentially harbour an interesting biological finding but if it is not the focus of your study or you are not willing to chase something like lincRNAs one could probably filter them out to reduce multiple testing burden. I am by no means a statistician so feel free to comment.

score 0 · Answer 2 · 2021-09-02

wget http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.chr.gtf.gz ./
zcat Homo_sapiens.GRCh38.104.chr.gtf.gz | grep "gene_biotype \"protein_coding\"" > Homo_sapiens.GRCh38.104.protein_coding.gtf
awk '$3=="gene"{print $14}' Homo_sapiens.GRCh38.104.protein_coding.gtf | sed 's/\"//g' | sed 's/\;//g'| sort -u > Homo_sapiens.GRCh38.104.protein_coding.txt

here is the output

A1BG
A1CF
A2M
A2ML1
A3GALT2
A4GALT
A4GNT
AAAS
AACS
AADAC
AADACL2
AADACL3
AADACL4
AADAT
AAGAB
AAK1
AAMDC
AAMP
AANAT
AAR2
AARD
AARS1