Hi.
I downloaded the fastq files from ebi.ac.uk Then I aligned them to hg38 on galaxy. And then I downloaded the latest version of gtf (release version 27) from http://www.gencodegenes.org/stats/archive.html#a27 to obtain read counts. As you see, the gtf file has 58288 genes of which 19836 genes are protein coding.
The problem is that when you obtain the count matrix from your bam files using this gtf, you will be given a matrix with 58288 rows displaying the total number of genes in the gtf file. And thousands of these genes are counted zero for all samples. How should I filter my count data to a reasonable number of genes? 58288 genes is way too much and weird. Especially, when many of them are zero.
In what way? There is only
19836
protein coding genes out of that lot.Also note that some of those genes have zero counts because they may be covered by multi-mapped reads, which are not counted by default by both fearureCounts/htseq-count.
Hey genomax. How are you? I want to estimate the parameters of NB distribution from a real count data. And I think that I should filter some of those genes because they may not represent a real datset and may negatively affect the NB distribution parameters.
And yes. 19836 genes are protein coding. Should I filter all of those 58288 genes that have zero counts?
Yes, you should filter out zero counts, and also low-expressed genes which might be observed due to noise. You can calculate FPKM or TPM Click here for simple explanation for each gene and plot the distribution and decide an arbitrary cut-off. If the distribution follows a normal distribution, you can use standard deviation approach to filter, if not use some non-parametric methods like lower-quartile.
Thank you very much arta. Yeah I think I will prefilter zero count transcripts first for simulation purpose. Then I will filter low expressed ones for analysis.
You should pre-filter based on the biotypes rather than the zero counts. As we have discussed in your other threads zero counts for some genes is a real world result. Best course here would be to use the protein coding genes for the simulation.
Are you planning to use DESeq2/edgeR for the DE analysis? You could use only the protein coding genes if that is all you are interested in.
Yes, I have done that in the past. In fact, one thing that I did was analyse different biotypes separately, such as protein coding, antisense RNA, linc-RNA, etc. One other thing I did was analyse all transcripts annotated in RefSeq (~22,000, covering various transcripts).
I think that it's also quite standard to remove transcripts that have low mean raw counts, such as those with mean < 5 or < 10.
Aha, Thank you so much Kevin. I wanted to use the count table for simulation purpose.
No, I was going to simulate data using that count table.