Entering edit mode
6.3 years ago
shivangi.agarwal800
▴
120
Hello all
I have obtained RNA-seq data from TCGA. The data contains about 56,000 genes (gene symbols and ensemble ids) that contains pseudo, uncharacterized, non-coding and coding genes. I am looking for the efficient way to filter out the data for only coding genes. Can anyone suggest?
Regards Shivangi Agarwal
In what format do you have the data?
I have patient'swise data which contains expression value (FPKM) for each of the gene in each of the patient. (FPKM.txt files)
Then I would get a GTF file for human of the correct assembly (hg19, hg38, don't know what TCGA used, e.g. from GENCODE), and then filter for genes that are coding (
CDS
in the GTF). From this, extract the gene names and use these to subset your data. This all can be done with Unix tools likeawk
.