Question

Why Some Refseq Genes Have Several Transcripts With Different Strands?

3

Entering edit mode

13.6 years ago

tflutre ▴ 590

Hello,

among RefSeq annotations (hg19, from UCSC), one can find genes with several transcripts on different strands. One also finds genes with the same identifier but on different chromosomes. Here is an example combining the two (gene "OR4F29"):

wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | grep "OR4F29"
587    NM_001005221    chr1    +    367658    368597    367658    368597    1    367658,    368597,    0    OR4F29    cmpl    cmpl    0,
589    NM_001005221    chr1    -    621095    622034    621095    622034    1    621095,    622034,    0    OR4F29    cmpl    cmpl    0,
1964    NM_001005221    chr5    +    180794287    180795226    180794287    180795226    1    180794287,    180795226,    0    OR4F29    cmpl    cmpl    0,

Are these errors, knowing that Ensembl has only one transcript for this gene?
What do you do in such a case? Discard the genes? Use Ensembl instead?

refseq strand gene • 8.4k views

ADD COMMENT • link updated 8.1 years ago by Biostar 20 • written 13.6 years ago by tflutre ▴ 590

0

Entering edit mode

Ensembl has listed those three positions as separate genes - http://asia.ensembl.org/Homo_sapiens/Search/Details?species=Homo_sapiens;idx=Gene;end=3;q=OR4F29 - each with a single transcript of the exact same sequence. Personally I find RefSeq more... useful in this case

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.6 years ago by Aaron Statham ★ 1.1k

score 5 · Answer 1 · 2011-09-26

Grab the sequence from NCBI. BLAT the sequence. Note the results:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END SPAN
   ---------------------------------------------------------------------------------------------------
   browser details NM_001005221.2   939     1   939   939 100.0%     1   -     621096    622034    939
   browser details NM_001005221.2   939     1   939   939 100.0%     5   +  180794288 180795226    939
   browser details NM_001005221.2   939     1   939   939 100.0%     1   +     367659    368597    939

(With lots more matches, but those are the three with 100% identity for the full length of the sequence).

There are multiple locations listed for this sequence because it appears at several loci in the genome.

score 4 · Answer 2 · 2011-09-26

4

Entering edit mode

13.6 years ago

Travis ★ 2.9k

I personally prefer the Ensembl approach in cases like this i.e. consider them as separate genes. Another option is to consider them as one gene but as sense/antisense representations of the gene. It depends a lot on preference and your own way of conceptualizing the genome/transcriptome. I would not discard any of the transcripts.

ADD COMMENT • link 13.6 years ago by Travis ★ 2.9k

0

Entering edit mode

@Travis I chose your answer as I will also use Ensembl for the moment, but mainly because the tool I am using doesn't allow to have several transcripts on different strands for the same gene.

ADD REPLY • link 13.6 years ago by tflutre ▴ 590

score 3 · Answer 3 · 2011-09-26

3

Entering edit mode

13.6 years ago

Casey Bergman 18k

Trans-splicing can also cause some gene models to (legitimately) be annotated to both strands, such as the Drosophila Mod(mdg4) locus: http://genome.cshlp.org/content/13/10/2220.full

The first indication of a requirement for trans-splicing in the generation of Mod(mdg4) proteins came after the realization that the two DNA strands of the gene have coding capabilities and contain coding sequences present in mature mRNAs that are translated into functional proteins

ADD COMMENT • link 13.6 years ago by Casey Bergman 18k

0

Entering edit mode

@Casey thanks for this good reference! For the other users, the initial paper (Labrador et al, Nature 2001) can be found here http://www.nature.com/nature/journal/v409/n6823/full/4091000a0.html

ADD REPLY • link 13.6 years ago by tflutre ▴ 590