To put it simply, I am doing differential expression analysis on human RNA-seq data and I want to focus my analysis of genes that are:
1) Protein coding, so no SNOR or MIR
2) Genes with a given name (I see genes like AC006064.5 and I am not sure what they are)
I am hoping that by refocusing my analysis on a smaller set of genes I would get more hits as FDR seems to be impacted by sample size (amount of genes tested).
Any advice on sub-sampling a .gtf file to protein coding genes only?
The GTF file should typically contain annotation information on the gene type. Select for example only protein_coding. That is how it is called in the GENCODE files, not sure what the exact term would be in RefSeq GTF. I typically remove a couple of RNA species before doing DEG, which is all smallRNAs (micro, sno, sn..., because they are not well-captured in standard RNA-seq and therefore not reliable) and TECs (to be experimentally confirmed) genes. What is left is (at least imho) the meaningful part of a standard RNA-seq experiment which then goes into dispersion estimation and model fitting. What you could probably do is to remove further gene types after the main DEG but prior to multiple testing correction if you have the feeling that you lose too many significant genes being eliminated by FDR. The FDR is definitely affected by the number of tested genes, and reducing the number of tests might make sense. From a "strategical" standpoint you could probably remove everything from which you a priori know that you will not benefit from finding these genes significant. That would be genes such as non-protein coding transcripts of (unknown) function. Finding these as significant might potentially harbour an interesting biological finding but if it is not the focus of your study or you are not willing to chase something like lincRNAs one could probably filter them out to reduce multiple testing burden. I am by no means a statistician so feel free to comment.
Makes a lot of sense, thanks. So I just did "cat Homo_sapiens.GRCh38.95.gtf | grep "gene_biotype \"protein_coding\"" > Homo_sapiens.GRCh38.95.protein_coding.gtf", is that ok in terms of creating a new gtf file?
Yeah, I guess that would do it. Basically reducing genes to protein-coding is what many microarray platforms do from what I understand, so I guess there is not really an argument against it.
Makes a lot of sense, thanks. So I just did "
cat Homo_sapiens.GRCh38.95.gtf | grep "gene_biotype \"protein_coding\"" > Homo_sapiens.GRCh38.95.protein_coding.gtf
", is that ok in terms of creating a new gtf file?Yeah, I guess that would do it. Basically reducing genes to protein-coding is what many microarray platforms do from what I understand, so I guess there is not really an argument against it.