Question

How To Merge Isoforms For A Gene

5

Entering edit mode

11.6 years ago

camelbbs ▴ 710

Hi All,

I want to ask a tech question. Does anybody have scripts or software that can merge multiple isoforms of a gene into a reference transcript.

Ideally, that script will merge the overlapped exons between isoforms. And give a reference transcript that include the longest exons.

Thanks,

Che

rnaseq • 14k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 11.6 years ago by camelbbs ▴ 710

0

Entering edit mode

Can you provide a snippet of input, or what you expect to feed into a script? Just a few lines would help.

ADD REPLY • link 11.6 years ago by Alex Reynolds 36k

0

Entering edit mode

I just want to handle the refflat.gtf file: chrX hg19_refFlat exon 120073834 120073989 0.000000 - . gene_id "CT47A3"; transcript_id "CT47A3_dup2";

ADD REPLY • link 11.6 years ago by camelbbs ▴ 710

0

Entering edit mode

Hi camelbbs,Were you able to solve this problem?I also want to have overlapping exons merged for a gene from a gtf file.

ADD REPLY • link 10.7 years ago by Ron ★ 1.2k

0

Entering edit mode

Hi Camelbbs. I'm adding this comment to all your questions: Please take some time, before you ask a question, to think more about your problems and most likely sources of answers (manuals, FAQs, Google!, etc.). When you ask a question, include some context, tell us why you ask that question, what result you need, etc. Most of your questions are vague, impossible to answer or you changed them following an answer because it became evident that it was not clear. Cheers.

ADD REPLY • link 11.6 years ago by Eric Normandeau 11k

0

Entering edit mode

Are you familiar with this area or I don't know why you say this.

ADD REPLY • link 11.6 years ago by camelbbs ▴ 710

score 5 · Answer 1 · 2013-05-07

Perhaps the following might get you started.

For exploration purposes, I exported the refFlat table from the UCSC Genome Browser as a GTF file and saved it somewhere I can find it:

$ wget https://dl.dropboxusercontent.com/u/31495717/refFlat.hg19.gtf.gz

I then extracted exons and converted the result to a BED file with BEDOPS gtf2bed‡, and passed it to an awk script that uses an associative array (hash table) to store results based on gene name:

$ gzcat refFlat.hg19.gtf.gz \
    | grep exon \
    | gtf2bed - \
    | awk '{ \
        name[$4]++; \
        if (name[$4] == 1) { \
            chr[$4] = $1; \
            start[$4] = $2; \
            stop[$4] = $3; \
            remainder[$4] = substr($0, index($0, $5)); \
        } \
        else { \
            stop[$4] = $3; \
        } \
    } \
    END { \
        for (id in name) { \
            printf("%s\t%s\t%s\t%s\t%s\n"), chr[id], start[id], stop[id], id, remainder[id]; \
        } \
    }' - \
    > mergedRefFlatExons.hg19.bed

At the first instance of an exon for a gene name, we assign values to elements of associative arrays for the gene name. Where we find two or more exons, the else condition of the if-else block changes the stop position for that gene to the stop position of the last current exon.

If this works for you, then perhaps you can extend it to meet the other condition (reference transcript with the longest exon) by keeping track of the longest current exon, which you might mark in the END block (perhaps with a custom GTF attribute printed at the end of the line).

‡ : Conversion to BED is not necessary. I did this because I am more familiar with handling BED data than GTF. If you are more familiar with GTF files and in which order attributes are stored, you can change the field assignments to elements of each associate array accordingly.

score 2 · Answer 2 · 2013-05-06

2

Entering edit mode

11.6 years ago

Abhi ★ 1.6k

Bedtools is good tool for playing with genomic features. It is fast and efficient. In your case bedtools merge should work. Look for example in the manual they are pretty helpful in getting a visual picture.

hth, -Abhi

ADD COMMENT • link 11.6 years ago by Abhi ★ 1.6k

3

Entering edit mode

Thanks, but i think bedtools merge can merge all the overlapped exons, ignoring the gene. I want to merge the isoforms belong to a same gene, not from different genes.

ADD REPLY • link 11.6 years ago by camelbbs ▴ 710

score 1 · Answer 3 · 2013-05-08

R's BioConductor gives you:

a object oriented re-packaging of ucsc gene model (named after entrez ids)
the function reduce to 'merge' the exons of the isoforms into a merged 'maximal' gene
a tool to map between entrez ids and gene names (providing names for the merged genes)
the rtracklayer package to write it back out in a gff3 file

run R and, first time only, install needed packages:

source("http://bioconductor.org/biocLite.R") 
biocLite(c("TxDb.Hsapiens.UCSC.hg19.knownGene","org.Hs.eg.db","rtracklayer","GenomicFeatures"))  
# Say 'yes' to all questions.  You may need a few more packages

Now, load the gene models, 'merge' the exons within each gene, give them a name, and write a gff of the results

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(rtracklayer)
library(org.Hs.eg.db)
exonsByGene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,'gene')
mergedGene <- reduce(exonsByGene) 
mergedGene <- 
  ## Remove a few hundred 'tricky cases' like:
  ##  * trans-spliced genes (?? are there any in hg19.knownGene ??)
  ##  * genes on multiple chromosomes (eg: alt. haplotypes "chr_ctg9_hap1" )
  ##    Arguably(?) these should not be considered the 'same gene'.
  mergedGene[1==elementLengths(runValue(strand(mergedGene))) &
                     1==elementLengths(runValue(seqnames(mergedGene)))]  
`]]1`<-function(i,o)o[[i]][[1]] # utility
names(mergedGene)<- # assign gene symbol for export to GFF
     sapply(names(mergedGene),`]]1`,org.Hs.egSYMBOL) # TODO: faster?
export(mergedGene,'Hsapiens.UCSC.hg19.knownGene.merged.gff3')

caveats:

3' and 5' UTR features are lost
the 'tricky cases'

score 1 · Answer 4 · 2013-05-08

1

Entering edit mode

11.6 years ago

cdsouthan ★ 1.9k

This is what Swiss-Prot have been doing for years...... The canonical ORF is the longest and max-exons by default. Literature and TrEMBL supported alternative splices are annotated in the feature lines

ADD COMMENT • link 11.6 years ago by cdsouthan ★ 1.9k

Ram · Answer 5 · 2013-09-21

0

Entering edit mode

11.3 years ago

Puriney ▴ 330

Probably you could take a look at this gist. Though better move is to get a BED12 finally rather than BED6.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 11.3 years ago by Puriney ▴ 330

Ram · Answer 6 · 2014-12-29

Hi,

I'm the developer of pyGeno. Here's a little script that does just that for the Gene TPST2, by using segment trees:

from pyGeno.Genome import *
from pyGeno.tools.SegmentTree import *

ref = Genome(name = "GRCh37.75")
gene = ref.get(Gene, name = "TPST2")[0]
seg = SegmentTree()

for trans in gene.get(Transcript) :
   for exon in trans.exons :
     #add the current exon position to the tree
     seg.insert(exon.start, exon.end)

#merge all exon positions
seg.flatten()

consensusSequence = ""
for exon in seg.children :
  consensusSequence += gene.chromosome.sequence[exon.x1:exon.x2]

print consensusSequence

Cheers