Thanks for taking the time to answer a newcomer's question.
I am looking for whether some proteins are expressed in some RNA-Seq data I was given. Unfortunately, the RNA-Seq data is not raw - it's a list of RPKMs associated with RefSeq genes. It was the only 'raw' dataset provided on GEO.
Over half of these 37,000 genes have an RPKM of 0; others have a wide range of expression. However, I cannot find any RefSeq IDs corresponding to genes that express my proteins of interest. I wonder if this is my fault, or the database's.
Here's what I did:
- Take a list of the proteins of interest, find HUGO genes encoding those proteins > feed HUGO gene names into BioMart and get a list of associated RefSeq IDs > search my database for those IDs (none!)
- Input the 37,000 RefSeq IDs I have into DAVID, generate an annotation report, and search for mentions of my protein of interest (none!)
What other sanity checks should I do before I claim that this RNA-Seq database does not, in fact, contain data for genes encoding the proteins we're interested in? I don't know if my proteins are actually expressed in this tissue. It seems so unlikely that they wouldn't appear anywhere in the database to jump to that conclusion - I've always thought of RNA-Seq as being 'comprehensive', and that every protein-encoding gene would have even a tiny number of reads. It would be weird if this database had so many genes with an RPKM of 0, and yet other genes were completely excluded.
Why are you putting 37,000 refseq IDs into David?
any annotation report on 37,000 genes is not going to be informative
I was trying (in a very dodgy way) to find out if there was *anything* in that list of genes that had to do with my proteins of interest, since I simply couldn't believe that none of those 37,000 genes encoded it.
Could Uniprot provide you with the relevant refseq IDs directly?
Also you're going from Protein to Gene to Transcript ID, therefore the specific transcript that is translated into your protein may not be coming up (there could be many different isoforms)
Your RPKM file will likely contain Refseq mRNA ("NM_*"), non-coding ("NC_*") IDs (as well as some predicted RNA ids ("X*") these are the kind of IDs you need
Hmm, okay. Will Uniprot accept protein names and provide related transcript ID? I can't quite figure out which software package exists to list all transcripts related to protein.