Hi, as the title says really, I'm wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3' UTR lenghts?
Thanks for any suggestions.
Hi, as the title says really, I'm wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3' UTR lenghts?
Thanks for any suggestions.
As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.
This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.
This can be done using the GenomicFeatures library from Bioconductor (and dplyr)
I will use the refSeq transcripts ("refGene") from mouse ("mm10")
library(GenomicFeatures)
library(dplyr)
refSeq <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")
threeUTRs <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs <- width(ranges(threeUTRs))
the_lengths <- as.data.frame(length_threeUTRs)
the_lengths <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")
The dataframe "the_lengths" has what you need.
This was great, thanks. I needed to plot a histogram of 3'UTR lengths for a whole genome and this was a neat and accurate solution. I tried lots of other ways and none of them were successful. As an additional note I discovered that "refGene" is no longer a supported table so used the supportedUCSCtables() function to determine that "knownGene" is the currently supported table.
> supportedUCSCtables(genome = "mm10", url="http://genome.ucsc.edu/cgi-bin/")
tablename track subtrack
1 knownGene UCSC Genes <NA>
truncated
I also noted that the gene with longest 3'UTR is an outlier and may actually be erroneous as it is a noncoding gene but for some reason there is a short coding track in its UCSC annotation which results init being annotated as having a long 3'UTR. The upper limit for 3'UTR lengths seems to be about 16k.
> the_lengths<- the_lengths[order(the_lengths$`3' UTR Length`,decreasing=TRUE),]
> head(the_lengths)
# A tibble: 6 x 2
`RefSeq Transcript` `3' UTR Length`
<chr> <int>
1 uc012fxx.1 82649
2 uc009css.1 15929
3 uc009cst.1 15929
4 uc008drr.3 15586
5 uc008drs.2 15586
6 uc008dfj.2 14319
Hello everyone,
Similar to the @marcosmorgan job, you can follow your analysis to end up with a table with more information:
# get 3' UTR lengths and transcripts
# load libraries
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(magrittr)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# extract all 3' UTR regions by transcript
threeUTR <- threeUTRsByTranscript(txdb, use.name=TRUE)
# extract the names of the transcripts
transcripts_three_UTR <- names(threeUTR)
# keys and column for the query
keys <- transcripts_three_UTR
columns <- c("GENEID", "TXEND", "TXSTART", "TXSTRAND", "TXCHROM")
# query
df_three_utr <- select(txdb,
keys=transcripts_three_UTR,
columns=c(columns),
keytype="TXNAME")
# computing the width of the 3' UTR regions
threeUTR_df <- as.data.frame(threeUTR)
threeUTR_df <- threeUTR_df %>%
dplyr::group_by(group, group_name) %>%
dplyr::summarize(sum(width))
# adding the information
df_three_utr$width_3_UTR <- threeUTR_df$`sum(width)`
# ordering
df_three_utr_sorted <- df_three_utr[order(as.numeric(df_three_utr$width_3_UTR),
decreasing=TRUE), ]
> head(df_three_utr_sorted)
TXNAME GENEID TXCHROM TXSTRAND TXSTART TXEND width_3_UTR
17445 uc010jgw.3 79628 chr5 - 148361713 148411250 22561
17446 uc003lpt.3 79628 chr5 - 148361713 148442737 22561
17447 uc003lpu.3 79628 chr5 - 148361713 148442737 22561
17448 uc010jgx.3 79628 chr5 - 148361713 148442737 22561
22344 uc021zzk.1 100131257 chr7 - 7115401 7136417 20462
8531 uc002sfp.2 22848 chr2 - 69685127 69870977 17874
Of course, you can order the table by the field of your interest.
Similar to @vperreau, you can do that from the command-line, too. For example with a script like this:
# Extract the features 3' UTR
bioawk -c gff '$feature ~ /3UTR/ {print $0}' hg19-knownGene.gtf > three_utr.gff
# Cut fields of interest
cut -f 1,4,5,7,9 three_utr.gff > three_utr_regions.txt
# Compute the width for each 3' UTR
awk '{print $3-$2+1}' three_utr_regions.txt > width.txt
# Add width column
paste three_utr_regions.txt width.txt > three_utr_width.txt
# Reorder columns
awk -F"\t" -v OFS="\t" '{print $1,$2,$3,$6,$4,$5}' three_utr_width.txt > three_utr_info.txt
# Sorting by width
sort -k4,4nr three_utr_info.txt > three_utr_info_sorted.txt
# Extract all gene ID
cut -f6 three_utr_info_sorted.txt | awk -F" " '{print $4}' | sed 's/"//g' | sed 's/;//g' > transcript_id.txt
# Replacing metadata for transcript ID
cut -f1,2,3,4,5 three_utr_info_sorted.txt > three_utr_ranges.txt
paste three_utr_ranges.txt transcript_id.txt > three_utr_width_genes.txt
# Keeping the transcript ID and the width and sorting
cut -f4,6 three_utr_width_genes.txt | sort -k2,2 > three_utr_width_sorted.txt
# Computing the total width of 3' UTR of all transcripts
awk -v OFS="\t" '{a[$2] += $1} END{for (i in a) print i,a[i]}' three_utr_width_sorted.txt | sort -k2,2nr > three_utr_widths.txt
The finished product is a tab delimited file with each 3' UTR and its length sorted by decreasing width:
$ head -n3 three_utr_widths.txt
uc003lpt.3 22561
uc003lpu.3 22561
uc010jgw.3 22561
The table KnownGene
in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).
The table kgXref
contains the NCBI id and is linked to KnownGene
.
for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):
mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
from
kgXref as X,
knownGene as K
where
X.geneSymbol!=""
and K.name=X.kgId and
K.strand="+" ;
+---------------+------------------+
| geneSymbol | K.txEnd-K.cdsEnd |
+---------------+------------------+
| BC032353 | 3006 |
| AX748260 | 3157 |
| BC048429 | 1540 |
| OR4F5 | 0 |
| OR4F5 | 1 |
| DQ599874 | 31 |
| DQ599768 | 78 |
(..)
Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.
Just something to consider...
The gene I worked on in grad school had alternate 3'-UTRs. http://www.ncbi.nlm.nih.gov/pubmed/1487151 Differed by over 700 bp. That said, another caveat of pulling this from dbs is that I know for UCSC that they don't make the call on presence of a UTR, they rely on the genbank record. Absence of a UTR does not necessarily mean there isn't one.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Paul, you can actually automate the query which Casey described by first doing it manually, then you click on 'Perl' in the upper right-hand corner of the interface and you will get a Perl program that retrieves the same query-results programmatically. You can then tweak the Perl program to your needs.
Or, to save the clicking, you could hack up a little Perl script using the EnsEMBL Perl API ;-)
Regarding to 3' UTR, why i cannot search 3' UTR sequence on the cDNA transcript sequence? 5' UTR is search no problem.