Gencode annotation filterring problems
1
0
Entering edit mode
2.2 years ago
Sarah • 0

Hello everyone!

I'm trying to apply the following filtering to the Human gencode annotation

library(rtracklayer)
library(GenomicFeatures)

path_to_gtf = "gencode.v41lift37.annotation.gtf"
gtf = rtracklayer::import(path_to_gtf)

gtf_filtered = gtf[which(
  gtf$type %in% c('gene', 'transcript','exon','CDS','UTR') &
  gtf$level %in% c('1','2') &
  gtf$gene_status=="KNOWN" &
  gtf$transcript_status=="KNOWN" &
  gtf$transcript_type=="protein_coding" &
  !(gtf$tag %in% c('cds_start_NF','cds_end_NF','mRNA_start_NF','mRNA_end_NF'))
                           )]
length(gtf_filtered)

So I want only 'gene', 'transcript','exon','CDS','UTR' records, where genes and transcripts are known, and transcripts code for proteins, with full ORFs

In the beginning I had 3380508 records, 252785 of which were transcripts After filtering I have only 4201 records, 242 are transcripts I was expecting 10-20 times more of transcripts

When I checked gene_type column I found out that 3335463 records have NA as gene_type KNOWN NOVEL PUTATIVE <NA> 30963 13715 367 3335463

Do you have any ideas why I have so little transcripts left after filtering?

Gencode R filtering • 678 views
ADD COMMENT
0
Entering edit mode

If you are willing to try alternate software then try https://github.com/NBISweden/AGAT

ADD REPLY
0
Entering edit mode
2.2 years ago

It seems your filtering is too stringent. Especially when you require the presence of those tags like cds_start_NF. To be honest I have no idea what that stands for or why you select for it.

And then, you join all conditions with AND operator, thus all of the conditions have to be met.

ADD COMMENT

Login before adding your answer.

Traffic: 1248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6