Question

differential RNA seq analysis

0

Entering edit mode

7.0 years ago

chrisclarkson100 ▴ 160

I am having trouble selecting a proper GTF file to perform differential RNA-seq analysis (in mm9), using annotation for protein coding genes (or at least all genes, but not pseudogenes). Importantly, I want at the end the quantification of gene expression per gene (not per transcript). Example command:

htseq-count -f bam -t CDS ${name}_mapped_reads/accepted_hits.bam Mus_musculus/Ensembl/GRCm38/Annotation/Archives/archive-2014-05-23-16-04-56/Genes/genes.gtf

I have tried using two different files, both downloaded from the following link: http://cole-trapnell-lab.github.io/cufflinks/igenome_table. First I downloaded the ensembl file and then the UCSC one but, having inspected the contents- neither satisfies the above criteria.

Does anyone know of a file/ how to filter in order to have a GTF file that will fulfil the above described criteria?

rna-seq • 1.8k views

ADD COMMENT • link updated 7.0 years ago by Friederike 9.0k • written 7.0 years ago by chrisclarkson100 ▴ 160

score 0 · Answer 1 · 2018-05-15

The -t CDS will direct htseq-count to only take the rows with "CDS" in the third column (this is my assumption, which is true for featureCounts, and IMHO has a high probability of being true for htseq-counts, too). To check which other entries are possible, you could use the following command:

$ cut -f3 gencode.vM16.annotation.gtf | sort | uniq
CDS
##contact: gencode-help@sanger.ac.uk
##date: 2017-12-01
##description: evidence-based annotation of the mouse genome (GRCm38), version M16 (Ensembl 91)
exon
##format: gtf
gene
##provider: GENCODE
Selenocysteine
start_codon
stop_codon
transcript
UTR

In my example file, you probably would want to go with "gene", so -t gene should do the trick.

Removing pseudogenes is a bit more annoying, that information tends to be stuck in the last column of a GTF file, in my file this looks like this:

$ egrep pseudogene gencode.vM16.annotation.gtf | head -n 1
chr10   HAVANA  transcript  129376961   129377401   .   +   .   gene_id "ENSMUSG00000107790.1"; transcript_id "ENSMUST00000205105.1"; gene_type "unprocessed_pseudogene"; gene_name "Olfr783-ps1"; transcript_type "unprocessed_pseudogene"; transcript_name "Olfr783-ps1-201"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTMUSG00000056564.1"; havana_transcript "OTTMUST00000139699.1"

# find out your options for gene type
$ cut -f9 gencode.vM16.annotation.gtf |  sed 's/.*gene_type//' | sed 's/;.*//' | sed 's/.\"\(.*\)\"/\1/' | sort | uniq
3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
##contact: gencode-help@sanger.ac.uk
##date: 2017-12-01
##description: evidence-based annotation of the mouse genome (GRCm38), version M16 (Ensembl 91)
##format: gtf
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_D_pseudogene
IG_J_gene
IG_LV_gene
IG_pseudogene
IG_V_gene
IG_V_pseudogene
lincRNA
macro_lncRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
##provider: GENCODE
pseudogene
ribozyme
rRNA
scaRNA
scRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
unitary_pseudogene
unprocessed_pseudogene

You could then use egrep -v to extract only those rows of the gtf that do not match the gene types you don't like, or, alternatively, egrep 'gene_type "protein_coding"' gencode.vM16.annotation.gtf might give you what you want.