Question

Predicted RefSeq genes

0

Entering edit mode

8.9 years ago

bios-2016 • 0

Howdy

I've spent numerous days attempting to accomplish something so fundamental I am embarrassed about myself, humiliated and extremely baffled. This is not all well documented and I feel embarrassed and idiotic to request help, it is a shocking feeling as a postgraduate. Convenient answers would be strongly valued.

I feel like this is an inability to document knowledge by biologists, and assumption that individuals will direct colleagues toward how to utilize the UCSC table browser since it's such a "straightforward" and "fundamental" part of bioinformatics.

Before you assume I am not suited to the field, I am a good coder and have a disability which exacerbates my trouble in utilizing this tool and feel disgraced in requesting help. i would simply like help and can't get it from my associates or administrators, partly because it is humiliating to admit inability to understand how to do this basic thing and feeling overwhelmed when trying to process tasks such as downloading tables to look through, and then when not finding it as promised.

I am attempting to get predicted genes in the RefSeq genes set on UCSC table browser, and have read in a paper they are in the UCSC table browser, however in the 13 tables I downloaded none have XM or XR, only NM and NR. I have read XM and XR are the codes for predicted genes, yet now I dont know whether it is a different field with PREDICTED in that indicates a predicted gene, and XM or XR is for a "model" gene.

gene genome • 3.6k views

ADD COMMENT • link updated 13 months ago by Ram 45k • written 8.9 years ago by bios-2016 • 0

2

Entering edit mode

Please don't be so hard on yourself. We are all somewhere on the don't-know-but-are-willing-to-learn spectrum, so the more you focus on the learning part, the better you'll feel about yourself.

ADD REPLY • link 8.9 years ago by Ram 45k

0

Entering edit mode

'Everything is hard before it is easy'

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

Ram · Answer 1 · 2016-09-04

0

Entering edit mode

8.9 years ago

GenoMax 153k

Take a look at the release notes for RefSeq that are located here. You will see that XM and XR codes for reserved for RNA. In case these are for species that are not covered by UCSC they are not going to be present in UCSC tables. Which may be one of the reasons you are not finding them.

You may want to do this using the following method. I am going to assume that you are after fasta sequence for these entries. It it is something else then give us a shout.

Get this file from NCBI.
Extract the XM_ and XR_ entries from this file. Put the XM_ and XR_ accession in a file (one per line). Refine as needed based on the species you are interested in.
Download a copy of the blast+ software package executables appropriate for your OS from NCBI here.
Grab a copy of the pre-made blast indexes for refseq_rna database (copy all .gz files that start with the name refseq_rna from this link and put them in a folder after uncompressing).
Use the blastdbcmd utility (part of the blast package) to extract sequence for the entries you want.

blastdbcmd -db /path_to_dir_with/refseq_rna -entry_batch your_accession_file -outfmt '%f' -out sequence.fa

PS: A biologist generally becomes more frustrated and baffled after looking at code (s)he does not understand. Your frustration has to rank less than theirs.

ADD COMMENT • link 8.9 years ago by GenoMax 153k

0

Entering edit mode

You confused me with species there - I take it you mean as in chemical species rather than biological species: there is a table called xenoRefSeq which isn't prominently explained and I didn't get around to finding out the meaning of.

The text you link to [release notes for RefSeq] has 4 types of RNA. In the UCSC table browser tables I only find the first 2

rna         *R_ and *M_ including: NM_; NR_; XM_; XR_

I was expecting cDNAs and predicted genes - which would be indicated by XM and XR I thought - unless the XM and XR are something else (they're called "model" and are uncurated?) I wanted the predicted genes, it was hard to see this in the handbook, thanks for the link.

I think now that a transcript which is only weakly supported by evidence may be being called "predicted" rather than one generated by a model (which would then be XR or XM) so it will be included in the data then I suppose(?) so I should look in what i have and this may have been a wasted question. It's just very confusing and I am not finding these resources helpful or well laid out (e.g. none of it is clearly related to the UCSC table browser nor any other resources online)

What I read says I just download a set of predicted genes from the Table Browser, not that I have to blast a database or anything

Thank you anyway

ADD REPLY • link 8.9 years ago by bios-2016 • 0

1

Entering edit mode

XM_ and XR_ are going to be predicted RNA's here is a selective set of examples from refseq_rna database.

>gi|124513265|ref|XM_001349953.1| Plasmodium falciparum 3D7 L-lactate dehydrogenase (PfLDH) mRNA, co
mplete cds
>gi|159491662|ref|XM_001703727.1| Chlamydomonas reinhardtii zygote-specific protein (ZYS1b) mRNA, co
mplete cds
>gi|922328574|ref|XM_013588226.1| Medicago truncatula maturase K domain protein partial mRNA
>gi|984142898|ref|XM_015476344.1| PREDICTED: Marmota marmota marmota NAP1-binding protein (LOC107134222), mRNA

>gi|309272549|ref|XR_106408.1| PREDICTED: Mus musculus predicted gene 11555 (Gm11555), misc_RNA
>gi|569009450|ref|XR_001556.3| PREDICTED: Mus musculus predicted gene 5636 (Gm5636), misc_RNA
>gi|161076441|ref|NR_003788.1| Drosophila melanogaster snoRNA:Me28S-A982a (snoRNA:Me28S-A982a), snoRNA

Question is what species (biologists always refer to species in organismal context) you are interested in. UCSC data focuses only on a small subset of species as opposed to what is present at NCBI (and thus the blast reseq database above) so if the species you are interested in is not present at UCSC you are not going to find data related to it.

What is it that you are trying to do with the XM_ and XR_ entries and for what biological organism/species?

ADD REPLY • link 8.9 years ago by GenoMax 153k

0

Entering edit mode

Thank you. As an aside, where can I learn more about how XM and XR sequences are created in detail?

ADD REPLY • link 8.9 years ago by Ram 45k

1

Entering edit mode

From RefSeq Manual:

What is the difference between XM_ and NM_ accessions?

Accession numbers that begin with the prefix XM_ (mRNA), XR_ (non-coding RNA), and XP_ (protein) are model RefSeqs produced either by NCBI’s genome annotation pipeline or copied from computationally annotated submissions to the INSDC. These RefSeq records are derived from the genome sequence and have varying levels of transcript or protein homology support. They represent the predicted transcripts and proteins annotated on the NCBI RefSeq contigs and may differ from INSDC mRNA submissions or from the subsequently curated RefSeq records (with NM_, NR_, or NP_ accession prefixes). These differences may reflect real sequence variation (polymorphism), or errors or gaps in the available genome sequence. The support for model RefSeq records should be further evaluated by comparing them to other sequence information available in Gene, BLink, Related Sequences, and BLAST reports.

The genome annotation pipelines are automated and their predicted products may or may not be subject to manual curation, but the data may be refreshed periodically.

ADD REPLY • link 8.9 years ago by GenoMax 153k

0

Entering edit mode

Thank you - I have read this, but I wonder what their genome annotation pipeline does, and how the RefSeq contigs are annotated by this pipeline. That was my question :)

ADD REPLY • link 8.9 years ago by Ram 45k

1

Entering edit mode

Algorithmic details for NCBI's Eukaryotic annotation pipelines are here and the prokaryotic one are here.

ADD REPLY • link 8.9 years ago by GenoMax 153k

0

Entering edit mode

Thank you - you're awesome!

ADD REPLY • link 8.9 years ago by Ram 45k

0

Entering edit mode

Hello,

Where to download RNA database with this format >gi|159491662|ref|XM_001703727.1| Chlamydomonas reinhardtii zygote-specific protein (ZYS1b) mRNA?

Thank you!

ADD REPLY • link updated 13 months ago by Ram 45k • written 13 months ago by demolidd77 ▴ 60

0

Entering edit mode

That's a generic FASTA format header. Please provide as much detail as you can.

ADD REPLY • link 13 months ago by Ram 45k