Hello everyone!
I'm trying to apply the following filtering to the Human gencode annotation
library(rtracklayer)
library(GenomicFeatures)
path_to_gtf = "gencode.v41lift37.annotation.gtf"
gtf = rtracklayer::import(path_to_gtf)
gtf_filtered = gtf[which(
gtf$type %in% c('gene', 'transcript','exon','CDS','UTR') &
gtf$level %in% c('1','2') &
gtf$gene_status=="KNOWN" &
gtf$transcript_status=="KNOWN" &
gtf$transcript_type=="protein_coding" &
!(gtf$tag %in% c('cds_start_NF','cds_end_NF','mRNA_start_NF','mRNA_end_NF'))
)]
length(gtf_filtered)
So I want only 'gene', 'transcript','exon','CDS','UTR' records, where genes and transcripts are known, and transcripts code for proteins, with full ORFs
In the beginning I had 3380508 records, 252785 of which were transcripts After filtering I have only 4201 records, 242 are transcripts I was expecting 10-20 times more of transcripts
When I checked gene_type column I found out that 3335463 records have NA as gene_type KNOWN NOVEL PUTATIVE <NA> 30963 13715 367 3335463
Do you have any ideas why I have so little transcripts left after filtering?
If you are willing to try alternate software then try https://github.com/NBISweden/AGAT