Retrieve genomic physical coordinates of 3'UTR for set of genes
3
1
Entering edit mode
5.0 years ago
Mr Locuace ▴ 180

Hello, I have a list of human genes and I'd like to retrieve the physical coordinates (GRCh37/hg19 assembly) of their 3'UTRs. Are you aware of any software that can do that? Thanks !

3'UTR • 3.5k views
ADD COMMENT
0
Entering edit mode

This answer holds good presuming that you already have a .gff file of the genome of interest:

I would recommend gffutils package in python (https://pythonhosted.org/gffutils/), I know you said "software" but this package is well documented for your need and you would basically just need to write 5-10 lines of code (which you can also find in the documentation) to retrieve the start and stop position on the basis of gene ids.

As an example, following is the code that I wrote :

#import the package
import gffutils

#create a "local database" from your gff file
db = gffutils.create_db(gff_file_path, dbfn = "local_db_1.db", keep_order = True,
                            force = True, sort_attribute_values = True, 
                            merge_strategy = 'merge')

#access every gene by its id like this
gene = db["gene_id"]

#access the gene's start and stop position like this
gene.start
gene.stop

#for accessing UTRs of this gene
for item in db.region(gene, featuretype="three_prime_UTR"):
     item.start
     item.stop

You can make a list of your ids and then use for loop to access every gene's and its respective UTR's start and stop position. Good luck!

ADD REPLY
3
Entering edit mode
5.0 years ago
vkkodali_ncbi ★ 3.8k

For RefSeq annotation, you can use the add_utrs_to_gff python script to first add 5' and 3' UTR features and then use unix grep to extract the genes of your interest. The latest RefSeq annotation of the GRCh37 assembly is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/

## download annotation in GFF3 format
$ curl -O https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

## download the add_utrs_to_gff3 python script 
$ curl -O https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/add_utrs_to_gff/add_utrs_to_gff.py

## add utr features to the gff3 file 
$ python3 add_utrs_to_gff.py GCF_000001405.25_GRCh37.p13_genomic.gff.gz > GRCh37_with_utrs.gff3

## extract 5' UTR for GeneID:5768 
$ grep 'five_prime_UTR' GRCh37_with_utrs.gff3 | grep -w 'GeneID:5768'
NC_000001.10    BestRefSeq      five_prime_UTR  180123968       180124042       .       +       .       ID=utr00100412821;Parent=rna-NM_001004128.2;transcript_id=NM_001004128.2;Dbxref=GeneID:5768,Genbank:NM_001004128.2,HGNC:HGNC:9756,MIM:603120
NC_000001.10    BestRefSeq      five_prime_UTR  180124004       180124042       .       +       .       ID=utr00282651;Parent=rna-NM_002826.5;transcript_id=NM_002826.5;Dbxref=GeneID:5768,Genbank:NM_002826.5,HGNC:HGNC:9756,MIM:603120
ADD COMMENT
0
Entering edit mode

Thanks very much @vkkodali ! But how to do it for a large list of GeneIDs?

ADD REPLY
0
Entering edit mode

You can use grep -f as shown below:

 ## make a list of all genes you are interested in, one gene ID for each line 
$ cat genes.txt 
81285
4991
143503

## extract 5' UTRs
$ grep 'five_prime_UTR' GRCh37_with_utrs.gff3 | grep -w -f genes.txt
ADD REPLY
0
Entering edit mode

Great, thanks @vkkodali !

ADD REPLY
1
Entering edit mode
5.0 years ago
ATpoint 86k

Download an annotation file for hg19, e.g. from GENCODE, then extract UTRs:

Extracting 5'UTR and 3'UTR bed files from gtf file

Then subset for the genes you are interested in.

ADD COMMENT
0
Entering edit mode
4.0 years ago
Jingyue ▴ 70

Download the bed file from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables).
Set the parameters as shown below:

  • clade: Mammal
  • genome: Human
  • assembly: 2009(GRCh37/hg19)
  • group: Genes and Gene Predictions
  • track: select the one that fits your experimental purpose
  • region: genome
  • output format: BED - browser extensible data

Click on ‘get output’.
Select ‘Create one BED record per: 3′ UTR Exons’.
Click on ‘get BED’ to download the mouse 3′ UTR annotation.

Reference: Sequencing cell-type-specific transcriptomes with SLAM-ITseq (https://www.nature.com/articles/s41596-019-0179-x#Sec32), Procedure 45-49.

Then you can grep your gene of interests from this 3-UTR bed file.

ADD COMMENT

Login before adding your answer.

Traffic: 1726 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6