I have a list of about 3,000 genes and I want to retrieve the genomic locus of each gene's 3' UTR. My goal is to screen all the sequence that could be part of a gene's 3' UTR, even if some of that sequence isn't always in the 3' UTR of every transcript.
I've tried the methods in a few previous answers and, for about 80% of the genes in my list, I can find the predicted start and end of the 3' UTR using Biomart or the UCSC table browser. The problem is that for each gene I get multiple results (for all the alternative transcripts of that gene), and in each result the 3' UTR starts and ends at a different place. What I would like is the site of the most upstream 3' UTR start and the most downstream 3' UTR end that have been predicted for a given gene.
Does anyone know a straightforward way to get these from UCSC or Biomart? Can I perhaps get the shortest predicted CDSend and longest predicted transcription end?
Thanks for your help!
Can you post a snippet (
head
) of your genes and UTR datasets? I have a solution in mind, but it would depend on the format of your datasets. Feel free to let us know if you'd like a hand. Cheers.Thanks for your help!
Here's a snippet of the output from Biomart.
I select dataset>'Human genes (GRCH37.p13)', Filters>'Gene Names' (here I paste a list of gene names, but I have refseq mRNA IDs too), Attributes>'Gene stable ID, Transcript stable ID, Gene name, 3' UTR start, 3' UTR end', Results>'unique results only'