Why Some Refseq Genes Have Several Transcripts With Different Strands?
3
3
Entering edit mode
13.2 years ago
tflutre ▴ 580

Hello,

among RefSeq annotations (hg19, from UCSC), one can find genes with several transcripts on different strands. One also finds genes with the same identifier but on different chromosomes. Here is an example combining the two (gene "OR4F29"):

wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
zcat refGene.txt.gz | grep "OR4F29"
587    NM_001005221    chr1    +    367658    368597    367658    368597    1    367658,    368597,    0    OR4F29    cmpl    cmpl    0,
589    NM_001005221    chr1    -    621095    622034    621095    622034    1    621095,    622034,    0    OR4F29    cmpl    cmpl    0,
1964    NM_001005221    chr5    +    180794287    180795226    180794287    180795226    1    180794287,    180795226,    0    OR4F29    cmpl    cmpl    0,
  • Are these errors, knowing that Ensembl has only one transcript for this gene?

  • What do you do in such a case? Discard the genes? Use Ensembl instead?

refseq strand gene • 8.0k views
ADD COMMENT
0
Entering edit mode

Ensembl has listed those three positions as separate genes - http://asia.ensembl.org/Homo_sapiens/Search/Details?species=Homo_sapiens;idx=Gene;end=3;q=OR4F29 - each with a single transcript of the exact same sequence. Personally I find RefSeq more... useful in this case

ADD REPLY
5
Entering edit mode
13.2 years ago

Grab the sequence from NCBI. BLAT the sequence. Note the results:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END SPAN
   ---------------------------------------------------------------------------------------------------
   browser details NM_001005221.2   939     1   939   939 100.0%     1   -     621096    622034    939
   browser details NM_001005221.2   939     1   939   939 100.0%     5   +  180794288 180795226    939
   browser details NM_001005221.2   939     1   939   939 100.0%     1   +     367659    368597    939

(With lots more matches, but those are the three with 100% identity for the full length of the sequence).

There are multiple locations listed for this sequence because it appears at several loci in the genome.

ADD COMMENT
0
Entering edit mode

David's last sentence is telling - this gene/transcript would be labeled as a likely repeat element, most probably a transcribed portion of an LTR or retro-element. Thus, I would do as Travis suggests and give such a gene some kind of alternate consideration/label.

ADD REPLY
4
Entering edit mode
13.2 years ago
Travis ★ 2.8k

I personally prefer the Ensembl approach in cases like this i.e. consider them as separate genes. Another option is to consider them as one gene but as sense/antisense representations of the gene. It depends a lot on preference and your own way of conceptualizing the genome/transcriptome. I would not discard any of the transcripts.

ADD COMMENT
0
Entering edit mode

@Travis I chose your answer as I will also use Ensembl for the moment, but mainly because the tool I am using doesn't allow to have several transcripts on different strands for the same gene.

ADD REPLY
3
Entering edit mode
13.2 years ago

Trans-splicing can also cause some gene models to (legitimately) be annotated to both strands, such as the Drosophila Mod(mdg4) locus: http://genome.cshlp.org/content/13/10/2220.full

The first indication of a requirement for trans-splicing in the generation of Mod(mdg4) proteins came after the realization that the two DNA strands of the gene have coding capabilities and contain coding sequences present in mature mRNAs that are translated into functional proteins

ADD COMMENT
0
Entering edit mode

@Casey thanks for this good reference! For the other users, the initial paper (Labrador et al, Nature 2001) can be found here http://www.nature.com/nature/journal/v409/n6823/full/4091000a0.html

ADD REPLY

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6