Hello, I have a list of human genes and I'd like to retrieve the physical coordinates (GRCh37/hg19 assembly) of their 3'UTRs. Are you aware of any software that can do that? Thanks !
Hello, I have a list of human genes and I'd like to retrieve the physical coordinates (GRCh37/hg19 assembly) of their 3'UTRs. Are you aware of any software that can do that? Thanks !
For RefSeq annotation, you can use the add_utrs_to_gff
python script to first add 5' and 3' UTR features and then use unix grep
to extract the genes of your interest. The latest RefSeq annotation of the GRCh37 assembly is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/
## download annotation in GFF3 format
$ curl -O https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
## download the add_utrs_to_gff3 python script
$ curl -O https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/add_utrs_to_gff/add_utrs_to_gff.py
## add utr features to the gff3 file
$ python3 add_utrs_to_gff.py GCF_000001405.25_GRCh37.p13_genomic.gff.gz > GRCh37_with_utrs.gff3
## extract 5' UTR for GeneID:5768
$ grep 'five_prime_UTR' GRCh37_with_utrs.gff3 | grep -w 'GeneID:5768'
NC_000001.10 BestRefSeq five_prime_UTR 180123968 180124042 . + . ID=utr00100412821;Parent=rna-NM_001004128.2;transcript_id=NM_001004128.2;Dbxref=GeneID:5768,Genbank:NM_001004128.2,HGNC:HGNC:9756,MIM:603120
NC_000001.10 BestRefSeq five_prime_UTR 180124004 180124042 . + . ID=utr00282651;Parent=rna-NM_002826.5;transcript_id=NM_002826.5;Dbxref=GeneID:5768,Genbank:NM_002826.5,HGNC:HGNC:9756,MIM:603120
Download an annotation file for hg19, e.g. from GENCODE, then extract UTRs:
Extracting 5'UTR and 3'UTR bed files from gtf file
Then subset for the genes you are interested in.
Download the bed file from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables).
Set the parameters as shown below:
Click on ‘get output’.
Select ‘Create one BED record per: 3′ UTR Exons’.
Click on ‘get BED’ to download the mouse 3′ UTR annotation.
Reference: Sequencing cell-type-specific transcriptomes with SLAM-ITseq (https://www.nature.com/articles/s41596-019-0179-x#Sec32), Procedure 45-49.
Then you can grep your gene of interests from this 3-UTR bed file.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This answer holds good presuming that you already have a .gff file of the genome of interest:
I would recommend gffutils package in python (https://pythonhosted.org/gffutils/), I know you said "software" but this package is well documented for your need and you would basically just need to write 5-10 lines of code (which you can also find in the documentation) to retrieve the start and stop position on the basis of gene ids.
As an example, following is the code that I wrote :
You can make a list of your ids and then use for loop to access every gene's and its respective UTR's start and stop position. Good luck!