I've spent numerous days attempting to accomplish something so fundamental I am embarrassed about myself, humiliated and extremely baffled. This is not all well documented and I feel embarrassed and idiotic to request help, it is a shocking feeling as a postgraduate. Convenient answers would be strongly valued.
I feel like this is an inability to document knowledge by biologists, and assumption that individuals will direct colleagues toward how to utilize the UCSC table browser since it's such a "straightforward" and "fundamental" part of bioinformatics.
Before you assume I am not suited to the field, I am a good coder and have a disability which exacerbates my trouble in utilizing this tool and feel disgraced in requesting help. i would simply like help and can't get it from my associates or administrators, partly because it is humiliating to admit inability to understand how to do this basic thing and feeling overwhelmed when trying to process tasks such as downloading tables to look through, and then when not finding it as promised.
I am attempting to get predicted genes in the RefSeq genes set on UCSC table browser, and have read in a paper they are in the UCSC table browser, however in the 13 tables I downloaded none have XM or XR, only NM and NR. I have read XM and XR are the codes for predicted genes, yet now I dont know whether it is a different field with PREDICTED in that indicates a predicted gene, and XM or XR is for a "model" gene.
Please don't be so hard on yourself. We are all somewhere on the don't-know-but-are-willing-to-learn spectrum, so the more you focus on the learning part, the better you'll feel about yourself.
Take a look at the release notes for RefSeq that are located here. You will see that XM and XR codes for reserved for RNA. In case these are for species that are not covered by UCSC they are not going to be present in UCSC tables. Which may be one of the reasons you are not finding them.
You may want to do this using the following method. I am going to assume that you are after fasta sequence for these entries. It it is something else then give us a shout.
Extract the XM_ and XR_ entries from this file. Put the XM_ and XR_ accession in a file (one per line). Refine as needed based on the species you are interested in.
Download a copy of the blast+ software package executables appropriate for your OS from NCBI here.
Grab a copy of the pre-made blast indexes for refseq_rna database (copy all .gz files that start with the name refseq_rna from this link and put them in a folder after uncompressing).
Use the blastdbcmd utility (part of the blast package) to extract sequence for the entries you want.
PS: A biologist generally becomes more frustrated and baffled after looking at code (s)he does not understand. Your frustration has to rank less than theirs.
You confused me with species there - I take it you mean as in chemical species rather than biological species: there is a table called xenoRefSeq which isn't prominently explained and I didn't get around to finding out the meaning of.
The text you link to [release notes for RefSeq] has 4 types of RNA. In the UCSC table browser tables I only find the first 2
rna *R_ and *M_ including: NM_; NR_; XM_; XR_
I was expecting cDNAs and predicted genes - which would be indicated by XM and XR I thought - unless the XM and XR are something else (they're called "model" and are uncurated?) I wanted the predicted genes, it was hard to see this in the handbook, thanks for the link.
I think now that a transcript which is only weakly supported by evidence may be being called "predicted" rather than one generated by a model (which would then be XR or XM) so it will be included in the data then I suppose(?) so I should look in what i have and this may have been a wasted question. It's just very confusing and I am not finding these resources helpful or well laid out (e.g. none of it is clearly related to the UCSC table browser nor any other resources online)
What I read says I just download a set of predicted genes from the Table Browser, not that I have to blast a database or anything
Question is what species (biologists always refer to species in organismal context) you are interested in. UCSC data focuses only on a small subset of species as opposed to what is present at NCBI (and thus the blast reseq database above) so if the species you are interested in is not present at UCSC you are not going to find data related to it.
What is it that you are trying to do with the XM_ and XR_ entries and for what biological organism/species?
What is the difference between XM_ and NM_ accessions?
Accession numbers that begin with the prefix XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein) are model RefSeqs produced either
by NCBI’s genome annotation pipeline or copied from computationally
annotated submissions to the INSDC. These RefSeq records are derived
from the genome sequence and have varying levels of transcript or
protein homology support. They represent the predicted transcripts and
proteins annotated on the NCBI RefSeq contigs and may differ from
INSDC mRNA submissions or from the subsequently curated RefSeq records
(with NM_, NR_, or NP_ accession prefixes). These differences may
reflect real sequence variation (polymorphism), or errors or gaps in
the available genome sequence. The support for model RefSeq records
should be further evaluated by comparing them to other sequence
information available in Gene, BLink, Related Sequences, and BLAST
reports.
The genome annotation pipelines are automated and their predicted products may or may not be subject to manual curation, but the data
may be refreshed periodically.
Thank you - I have read this, but I wonder what their genome annotation pipeline does, and how the RefSeq contigs are annotated by this pipeline. That was my question :)
Please don't be so hard on yourself. We are all somewhere on the don't-know-but-are-willing-to-learn spectrum, so the more you focus on the learning part, the better you'll feel about yourself.
'Everything is hard before it is easy'