Updated
I wish to calculate distance to polyA site for each exon for each gene. The ultimate data.frame will somehow be :
GeneA.name Exon1 Start End Distance
GeneA.name Exon2 Start End Distance
GeneB.name Exon1 Start End Distance
GeneB.name Exon2 Start End Distance
Each gene has many isoforms, namely NM1234, NM12345, NM_123456. If I don't assemble isoforms into one universal data, the exons will get duplicated.
My idea is to get all the exons location for given gene, but the isoforms information upsets me.
Original posts
For a given gene (let's say gene: HIPK1 here), I want to have all the exons assembled all in just single one line BED12 format.
But the problem is that each gene probably has different isoforms (different exons assembly). For example NM198268 and NM152696 are two isoforms of gene HIPK1. These isoforms share consensus exons.
Is there any method to give me the universal date set? Maybe UCSC genome browser has the default tool?
I've a backup plan: use mergeBED
(BEDTOOLS suit)to get the overall isoforms. Because mergeBED
will drop ID information and only save location information, a hash
indexing the gene name and "NM" name is needed in home-made perl, which is time-wasting if we already have the easy way to export.
Thank you.
"But the problem is that each gene probably has different isoforms (different exons assembly). "... This is not a problem -- this is a feature of the transcriptomic plasticity :). If you can explain the exact context of your analysis to perform this task, someone may provide you a detailed answer.
Thanks for you advises. I've updated my posts, and could you provide some hints? Thank you.