which kind of gene_biotype should we usaully remove?
2
0
Entering edit mode
10.4 years ago
R ▴ 10

Hi

The RNA-seq data which I work with are ribosomal RNA depleted libraries, meaning they contain ncRNAs, snRNAs etc... in addition to mRNAs. To filter ensemble gtf file before counting, which kind of gene_biotype should I remove?

high-abundance RNAs including mt-RNA,rRNA, snRNA, snoRNA, tRNA, histone RNAs ....?

pseudogene?

rna-seq • 6.5k views
ADD COMMENT
3
Entering edit mode
10.4 years ago
pld 5.1k

I'd filter on length rather than biotype, you can't really detect things smaller than your read size. By definition this will (usually) exclude things like miRNAs, tRNA, snoRNA, etc. Other then that, you might not want to include psuedogenes.

ADD COMMENT
0
Entering edit mode

Thanks, so which length do you usually use as cutoff or based on which criterion? if I have 50 bp reads , then I should use it as cutoff?

ADD REPLY
0
Entering edit mode

How long are your reads?

ADD REPLY
0
Entering edit mode

Single end, all reads between 39-42 bp

and how about histone RNAs?

ADD REPLY
0
Entering edit mode

If you can capture them with you sequencing, why not detect them? I realize the analysis can become more complex if one expands out of your typical pool of mRNA. However, the goal of RNA-Seq is to characterize the approximate fold change of RNA species present in your biological source (cells, tissue, etc).

I think narrowing the classes of RNA you are considering a priori is bad science. If you throw out a class of RNA you are effectively saying that the class has no biological role in what you are studying. There's no good reason that I can see for filtering biotypes for anything other than size.

ADD REPLY
1
Entering edit mode
10.4 years ago
komal.rathi ★ 4.1k

In the Ensembl gtf file, there are many types of genes:

  • 3prime_overlapping_ncrna
  • antisense
  • IG_C_gene
  • IG_D_gene
  • IG_J_gene
  • IG_LV_gene
  • IG_V_gene
  • IG_V_pseudogene
  • lincRNA
  • miRNA
  • misc_RNA
  • Mt_rRNA
  • Mt_tRNA
  • polymorphic_pseudogene
  • processed_transcript
  • protein_coding
  • pseudogene
  • rRNA
  • sense_intronic
  • sense_overlapping
  • snoRNA
  • snRNA
  • TR_V_gene
  • TR_V_pseudogene

Out of these, we usually keep protein_coding & lincRNA because we are interested in identifying differentially expressed and novel lincRNAs. Once we also kept pseudogene, antisense & miRNA because our aim was to identify whether such genes are differentially expressed or not, and if that's the case then find whether they are near any of the differentially expressed protein-coding genes (to correlate whether a pseudogene, antisense or miRNA is regulating a protein-coding gene). So depending on what your aim is, you may filter out different gene types. We usually apply a secondary filter depending on the "expected" length of the gene (filtering out lincRNAs that are <200 bp long and so on).

ADD COMMENT
0
Entering edit mode

Thank you very much. very helpful.

ADD REPLY
0
Entering edit mode

And how about histone RNAs?

ADD REPLY
0
Entering edit mode

Does your Ensembl GTF have a value like that in gene_biotype field? I have never come across it (or may have missed it).

ADD REPLY
0
Entering edit mode

No, not in the GTF file. I meant to remove those RNAs which come from histones. In my final differentially expressed genes, I have a lot of Hist3.., Hist4..., Hist1..., ....

ADD REPLY
0
Entering edit mode

It depends on what you are trying to achieve, what's your final goal?

ADD REPLY
0
Entering edit mode

I did not expect a lots of them as differentially expressed genes, I thought may be my normalization was not correct!! Thats the case. RPKM calculation shows no change but DESeq 100 fold!!!!!

ADD REPLY
0
Entering edit mode

Depending on the cells you have and what you're studying, it might make sense that there is differential expressions of histones. As always, qPCR is a great way to double check.

ADD REPLY
0
Entering edit mode

thanks, I will check them by qPCR

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6