The -t CDS
will direct htseq-count
to only take the rows with "CDS" in the third column (this is my assumption, which is true for featureCounts
, and IMHO has a high probability of being true for htseq-counts
, too).
To check which other entries are possible, you could use the following command:
$ cut -f3 gencode.vM16.annotation.gtf | sort | uniq
CDS
exon
gene
Selenocysteine
start_codon
stop_codon
transcript
UTR
In my example file, you probably would want to go with "gene", so -t gene
should do the trick.
Removing pseudogenes is a bit more annoying, that information tends to be stuck in the last column of a GTF file, in my file this looks like this:
$ egrep pseudogene gencode.vM16.annotation.gtf | head -n 1
chr10 HAVANA transcript 129376961 129377401 . + . gene_id "ENSMUSG00000107790.1"; transcript_id "ENSMUST00000205105.1"; gene_type "unprocessed_pseudogene"; gene_name "Olfr783-ps1"; transcript_type "unprocessed_pseudogene"; transcript_name "Olfr783-ps1-201"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTMUSG00000056564.1"; havana_transcript "OTTMUST00000139699.1"
$ cut -f9 gencode.vM16.annotation.gtf | sed 's/.*gene_type//' | sed 's/;.*//' | sed 's/.\"\(.*\)\"/\1/' | sort | uniq
3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_D_pseudogene
IG_J_gene
IG_LV_gene
IG_pseudogene
IG_V_gene
IG_V_pseudogene
lincRNA
macro_lncRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
scRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
unitary_pseudogene
unprocessed_pseudogene
You could then use egrep -v
to extract only those rows of the gtf that do not match the gene types you don't like, or, alternatively, egrep 'gene_type "protein_coding"' gencode.vM16.annotation.gtf
might give you what you want.