Filter out data for coding genes

0

Entering edit mode

6.3 years ago

shivangi.agarwal800 ▴ 120

Hello all

I have obtained RNA-seq data from TCGA. The data contains about 56,000 genes (gene symbols and ensemble ids) that contains pseudo, uncharacterized, non-coding and coding genes. I am looking for the efficient way to filter out the data for only coding genes. Can anyone suggest?

Regards Shivangi Agarwal

coding RNA-Seq • 2.2k views

ADD COMMENT • link 6.3 years ago by shivangi.agarwal800 ▴ 120

0

Entering edit mode

In what format do you have the data?

ADD REPLY • link 6.3 years ago by ATpoint 85k

0

Entering edit mode

I have patient'swise data which contains expression value (FPKM) for each of the gene in each of the patient. (FPKM.txt files)

ADD REPLY • link 6.3 years ago by shivangi.agarwal800 ▴ 120

3

Entering edit mode

Then I would get a GTF file for human of the correct assembly (hg19, hg38, don't know what TCGA used, e.g. from GENCODE), and then filter for genes that are coding (CDS in the GTF). From this, extract the gene names and use these to subset your data. This all can be done with Unix tools like awk.

ADD REPLY • link 6.3 years ago by ATpoint 85k

Login before adding your answer.