In this paper:
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8
"A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification"
the authors compare different annotations effect on RNA-seq analysis, and they "[...] demonstrated that the choice of a gene model has an effect on the quantification results. [...] When choosing an annotation database, researchers should keep in mind that no database is perfect and some gene annotations might be inaccurate or entirely wrong. Wu et al. suggested that when conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation, such as RefGene, might be preferred. When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. Based upon our experience of RNA-Seq data analysis, we recommend using RefGene annotation if RNA-Seq is used as a replacement for a microarray in transcriptome profiling."
This let me think about filter the annotation. I am interested in a transcriptome profiling, so I am really not interested in a gene differential expressed with unknown function (for example). For consistency I would like to continue to use ensembl, in mouse annotation gtf file there are 55k genes, I wrote a small script to filter the gtf file only for genes annotated as havana_ensembl, this reduced the n of genes to 22K. Then I've done the RNAseq analysis with featurecounts/Deseq2 using the full_annontation and the filtered_annotation. The pvalue plots seems better using the filtered annotation (here I loaded just one example, left full annotation, right filtered annotation):
Do you think this approach is valid? I checked also on the biostar handbook but I didn't find anything about the annotation size and filtering.
Do you know a package to filter gtf file? I didn't find anything and I wrote mine, but I would feel more confident to use something already well tested.
Thanks.
I suggest you fit the normalisation in DESeq2 before you filter the genes as the normalisation will be more accurate if all the genes are there. After that, if you want to filter things you don't think are real, go ahead. Personally I would filter on TSL (transcript support level), rather than on havana or not.
A friend of mine has a post on this as regards quantification with Kalisto.
That's interesting! For what I understood there are two big differences (please tell me if I am wrong):
1) There Tom used as "unfiltered" only protein_coding transcript (what I call "protein_coding filtered"), while for me unfiltered is everything. Actually this is what I would like to use: the protein_coding filtered genes. Here I generated some venn diagram about it:
So what Tom calls "filtered" is the intersection between protein_coding and TSL1-3.
2) (less relevant) kallisto works with transcripts, here I set up featurecounts to count exon of gene_id, so doesn't matter how many transcripts belong to a gene (as long as they use the same exons)
Thanks, I'll try it when I have a moment...
I would avoid filtering genes prior to running
DESeq
. The gene-wise dispersion estimates might benefit from having more rather than fewer genes to take information from given the filtered genes are not all low-count genes. What you can (probably) do is to remove those genes from thedds
object after runningDESeq
but prior to runningresults()
as this decreases the multiple testing burden, which is probably why you get different p-values (I think/hope these are the FDR-adjusted p-values?).So, this is using Galaxy/Deseq2. But I can try what you suggested with R/Deseq2
Mmm...no, they are p-values, if I understood this plot properly make no sense do it on FDR-adjusted p-values (but I might be wrong)
HAVANA genes are manually curated so it is not surprising that they are reliably detected in your data.
Are you filtering annotations before counting or are you filtering count matrix? If latter is it before it goes into DESeq2 or after the analysis?
I am filtering before counting. So I gave to FeatureCounts the filtered gtf (the filtered gtf for havana_ensembl genes is almost equivalent to only protein coding genes). As you can see from the summary plots if I am using data from tot RNA, I lose some reads as "nofeature" (but again, I don't think I care because they belong to not well characterize genes, so they are not informative to my project). If I use data from polyA RNA there is no much difference (bit less "ambiguity", bit more "nofeature")
sito hosting immagini
Are these two different sets of data?
Yes, downloaded from GEO, I wanted just check the different behaviour with different kind of dataset...
EDIT: sorry, I just realized there is a mistake in the first 2 plots, I uploaded it again (it doesn't change too much)