Question

Could a "complete" RNA-Seq database feasibly exclude some protein-coding transcripts?

0

Entering edit mode

10.5 years ago

Kristin Muench ▴ 640

Thanks for taking the time to answer a newcomer's question.

I am looking for whether some proteins are expressed in some RNA-Seq data I was given. Unfortunately, the RNA-Seq data is not raw - it's a list of RPKMs associated with RefSeq genes. It was the only 'raw' dataset provided on GEO.

Over half of these 37,000 genes have an RPKM of 0; others have a wide range of expression. However, I cannot find any RefSeq IDs corresponding to genes that express my proteins of interest. I wonder if this is my fault, or the database's.

Here's what I did:

Take a list of the proteins of interest, find HUGO genes encoding those proteins > feed HUGO gene names into BioMart and get a list of associated RefSeq IDs > search my database for those IDs (none!)
Input the 37,000 RefSeq IDs I have into DAVID, generate an annotation report, and search for mentions of my protein of interest (none!)

What other sanity checks should I do before I claim that this RNA-Seq database does not, in fact, contain data for genes encoding the proteins we're interested in? I don't know if my proteins are actually expressed in this tissue. It seems so unlikely that they wouldn't appear anywhere in the database to jump to that conclusion - I've always thought of RNA-Seq as being 'comprehensive', and that every protein-encoding gene would have even a tiny number of reads. It would be weird if this database had so many genes with an RPKM of 0, and yet other genes were completely excluded.

RNA-Seq • 3.1k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Why are you putting 37,000 refseq IDs into David?

any annotation report on 37,000 genes is not going to be informative

ADD REPLY • link 10.5 years ago by jeales ▴ 130

0

Entering edit mode

I was trying (in a very dodgy way) to find out if there was *anything* in that list of genes that had to do with my proteins of interest, since I simply couldn't believe that none of those 37,000 genes encoded it.

ADD REPLY • link 10.5 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Could Uniprot provide you with the relevant refseq IDs directly?

Also you're going from Protein to Gene to Transcript ID, therefore the specific transcript that is translated into your protein may not be coming up (there could be many different isoforms)

Your RPKM file will likely contain Refseq mRNA ("NM_*"), non-coding ("NC_*") IDs (as well as some predicted RNA ids ("X*") these are the kind of IDs you need

ADD REPLY • link 10.5 years ago by jeales ▴ 130

0

Entering edit mode

Hmm, okay. Will Uniprot accept protein names and provide related transcript ID? I can't quite figure out which software package exists to list all transcripts related to protein.

ADD REPLY • link 10.5 years ago by Kristin Muench ▴ 640

Ram · Answer 1 · 2015-02-03

I would say it's normal that a lot of genes have 0 counts and therefore RPKM = 0. You're right in saying that RNAseq is "comprehensive", but you have to remember that the final signal is a digital one. So if the true RPKM would be a very small number it will be rounded to 0, depending on the sequencing depth.

that said, it's of course a pain in the neck to deal with processed RPKM instead of raw data. because depending on the annotation they used for counting, your genes could or could not be present in the analysis. If you had the actual reads, you could of course get a value for any of your gene / annotation, even if that might be 0.

What is the id they used in the "raw" data? I wouldn't use DAVID for id conversion, it was last updated 5 years ago. I usually use biomart and get a table with external ids, which then you can join with your id list