Question

Gencode annotation filterring problems

0

Entering edit mode

2.2 years ago

Sarah • 0

Hello everyone!

I'm trying to apply the following filtering to the Human gencode annotation

library(rtracklayer)
library(GenomicFeatures)

path_to_gtf = "gencode.v41lift37.annotation.gtf"
gtf = rtracklayer::import(path_to_gtf)

gtf_filtered = gtf[which(
  gtf$type %in% c('gene', 'transcript','exon','CDS','UTR') &
  gtf$level %in% c('1','2') &
  gtf$gene_status=="KNOWN" &
  gtf$transcript_status=="KNOWN" &
  gtf$transcript_type=="protein_coding" &
  !(gtf$tag %in% c('cds_start_NF','cds_end_NF','mRNA_start_NF','mRNA_end_NF'))
                           )]
length(gtf_filtered)

So I want only 'gene', 'transcript','exon','CDS','UTR' records, where genes and transcripts are known, and transcripts code for proteins, with full ORFs

In the beginning I had 3380508 records, 252785 of which were transcripts After filtering I have only 4201 records, 242 are transcripts I was expecting 10-20 times more of transcripts

When I checked gene_type column I found out that 3335463 records have NA as gene_type KNOWN NOVEL PUTATIVE <NA> 30963 13715 367 3335463

Do you have any ideas why I have so little transcripts left after filtering?

Gencode R filtering • 678 views

ADD COMMENT • link updated 2.2 years ago by Istvan Albert 102k • written 2.2 years ago by Sarah • 0

0

Entering edit mode

If you are willing to try alternate software then try https://github.com/NBISweden/AGAT

ADD REPLY • link 2.2 years ago by GenoMax 147k

score 0 · Answer 1 · 2022-09-20

0

Entering edit mode

2.2 years ago

Istvan Albert 102k

It seems your filtering is too stringent. Especially when you require the presence of those tags like cds_start_NF. To be honest I have no idea what that stands for or why you select for it.

And then, you join all conditions with AND operator, thus all of the conditions have to be met.

ADD COMMENT • link 2.2 years ago by Istvan Albert 102k