how to select genes after obtaining read counts
0
0
Entering edit mode
7.0 years ago
statfa ▴ 790

Hi.

I downloaded the fastq files from ebi.ac.uk Then I aligned them to hg38 on galaxy. And then I downloaded the latest version of gtf (release version 27) from http://www.gencodegenes.org/stats/archive.html#a27 to obtain read counts. As you see, the gtf file has 58288 genes of which 19836 genes are protein coding.

The problem is that when you obtain the count matrix from your bam files using this gtf, you will be given a matrix with 58288 rows displaying the total number of genes in the gtf file. And thousands of these genes are counted zero for all samples. How should I filter my count data to a reasonable number of genes? 58288 genes is way too much and weird. Especially, when many of them are zero.

gtf RNA-Seq read counts genes • 2.7k views
ADD COMMENT
0
Entering edit mode

58288 genes is way too much and weird.

In what way? There is only 19836 protein coding genes out of that lot.

Also note that some of those genes have zero counts because they may be covered by multi-mapped reads, which are not counted by default by both fearureCounts/htseq-count.

ADD REPLY
0
Entering edit mode

Hey genomax. How are you? I want to estimate the parameters of NB distribution from a real count data. And I think that I should filter some of those genes because they may not represent a real datset and may negatively affect the NB distribution parameters.

And yes. 19836 genes are protein coding. Should I filter all of those 58288 genes that have zero counts?

ADD REPLY
0
Entering edit mode

Yes, you should filter out zero counts, and also low-expressed genes which might be observed due to noise. You can calculate FPKM or TPM Click here for simple explanation for each gene and plot the distribution and decide an arbitrary cut-off. If the distribution follows a normal distribution, you can use standard deviation approach to filter, if not use some non-parametric methods like lower-quartile.

ADD REPLY
0
Entering edit mode

Thank you very much arta. Yeah I think I will prefilter zero count transcripts first for simulation purpose. Then I will filter low expressed ones for analysis.

ADD REPLY
0
Entering edit mode

You should pre-filter based on the biotypes rather than the zero counts. As we have discussed in your other threads zero counts for some genes is a real world result. Best course here would be to use the protein coding genes for the simulation.

ADD REPLY
0
Entering edit mode

Are you planning to use DESeq2/edgeR for the DE analysis? You could use only the protein coding genes if that is all you are interested in.

ADD REPLY
0
Entering edit mode

Yes, I have done that in the past. In fact, one thing that I did was analyse different biotypes separately, such as protein coding, antisense RNA, linc-RNA, etc. One other thing I did was analyse all transcripts annotated in RefSeq (~22,000, covering various transcripts).

I think that it's also quite standard to remove transcripts that have low mean raw counts, such as those with mean < 5 or < 10.

ADD REPLY
0
Entering edit mode

Aha, Thank you so much Kevin. I wanted to use the count table for simulation purpose.

ADD REPLY
0
Entering edit mode

No, I was going to simulate data using that count table.

ADD REPLY

Login before adding your answer.

Traffic: 2082 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6