Entering edit mode
7 months ago
kim
▴
70
Hi, I'd like to retrieve 5'UTR sequence of my genes (total around 20,000) and make a text file. Here is my input file, "input.txt”
no geneID symbol
1 387700 SLC16A12
… … …
20000 3075 CFH
Basically I am trying to retrieve for a specific gene the longest isoform and, for that isoform, 5'utr coordinates. I want to make new result text file (**"result.txt"**) like this.
no geneID symbol UCSCid chr start end width strand sequence
1 8813 DPM1 uc001kgm.3 chr10 91295199 91295313 115 - CTGAGCCACCGCGCCCTCTGCACAACCGTGTCAGGGAGAGGGAAGAGGGGGAGCCGAGGAAGCTTCTTGGCCAGGGCGCTGGGGATCCGGGAGCGGGGGCCGCCGGCGAGACTGA
387700 SLC16A12
.......
20000 3075 CFH .........
Currently, I don't know how to do it, so I am just doing one by one (1/20000....).. Looking forward to your help!
Here the code to the results:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(GenomicFeatures)
library(BSgenome)
showMethods("getSeq")
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
# specify the gene ID
gid ='387700'
# select the longest transcript
cur_tx = data.frame(transcriptsBy(txdb, by="gene")[[as.character(gid)]])
max_tx_id = cur_tx$tx_name[which.max(cur_tx$width)]
# obtain the longest 5'UTR for the selected transcript
fiveUTR = data.frame(fiveUTRsByTranscript(txdb,use.names=TRUE)[max_tx_id])
fiveUTR = fiveUTR[which.max(fiveUTR$end),]
fiveUTR
#group group_name seqnames start end width strand exon_id exon_name exon_rank
# 1 uc001kgm.3 chr10 91295199 91295313 115 - 142243 <NA> 1
getSeq(genome, fiveUTR$seqnames, start = fiveUTR$start, end = fiveUTR$end)
#115-letter DNAString object
#seq: CTGAGCCACCGCGCCCTCTGCACAACCGTGTCAGGGAGAGGGAAGAGGGGGAGCCGAGGAAGCTTCTTGGCCAGGGCGCTGGGGATCCGGGAGCGGGGGCCGCCGGCGAGACTGA
Have you tried to see if you can use
table browser
using those regions of interest?