Question

filtering out the genes in RNA-seq experiment

1

Entering edit mode

10.1 years ago

ashkan ▴ 160

Hi Guys

I have a set of RNA-seq data and so far I have prepared my data and the number of raw read counts for each gene for each sample is calculated also I have a matrix in which the columns are samples and rows are genes. now I want to filter out some of the genes to reduce the false positive rate. would you please let me know how I can do the filtering?

Actually I have tried "read count per million" and it is calculated for every gene in every sample but I don't know how to determine the best cut off value for that. (for example can I say if the number of read counts of a gene is 2 or less than 2 and it happens in at least 10 sample this gene must be removed?)

Thanks,
Behzad

RNA-Seq • 5.6k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by ashkan ▴ 160

0

Entering edit mode

Filtering is generally performed on the adjusted p-values and fold-changes. Have you used edgeR/DESeq2/etc. to calculate that yet?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

@Devon: I have not done DE analysis yet. before that I want to remove some genes that are not expressed. as you know even the genes which are not expressed, have few read count.

So I want to filter out these genes.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by ashkan ▴ 160

2

Entering edit mode

Just do independent filtering after the fact (if you use DESeq2, this is automatic).

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-06-23

You can use the R function varFilter() that is part of the genefilter package to remove genes that are invariant across all samples. This will remove all non-expressed genes from your list (usually cuts mine by half). If you are using packages like DESeq2, I think it does this for you, so no need to run varFilter() before hand. Also, DESeq2 will adjust the calculated fold change for genes that have low read counts since low read counts can inflate true fold changes, so you shouldn't have to worry about low counts when using DESeq2.