Extracting exons and transcripts from gff3/gtf
4
0
Entering edit mode
3.2 years ago
mglasena ▴ 40

I'm hoping to create the following bed files from the purple urchin genome assembly:

(1) One record for each unique, non-overlapping exon

(2) One record for the longest transcript of each protein-coding gene.

I'm also hoping to eventually create sequence alignments for each transcript with the introns removed.

Are there software programs that help with this? The UCSC genome table browser does not allow you to select only one transcript per gene.

The NCBI Annotation report says that there are 245,575 exons in coding transcripts. I am not able to hit the number using simple UNIX commands. For example:

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | grep -i -e "XM_" -e "NM_" | cut -f 1,4,5 | bedtools sort -i - | uniq | wc -l 
247301

enter image description here

assembly gtf annotation gff3 • 6.7k views
ADD COMMENT
0
Entering edit mode

You may want to check the following files from : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/

  *_cds_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all CDS 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.

*_rna.fna.gz file
       FASTA format of accessioned RNA products annotated on the genome 
       assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA 
       products are not instantiated as a separate accessioned record in GenBank
       but are provided for some RefSeq genomes, most notably the eukaryotes.)
       The FASTA title is provided as sequence accession.version plus 
       description.

   *_rna_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all RNA 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.
ADD REPLY
3
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks so much for sharing this! This program (AGAT) is brilliant. It can do everything I asked about in the question posting and avoids the need to write custom code.

ADD REPLY
0
Entering edit mode
3.2 years ago
noodle ▴ 590

Depending on exactly what you want to extract, just import into R/python, pull out whatever you want, and save as bed...just make sure your questions are biological 'legit' because excluding isoforms and/or overlapping genes like you ask seems odd.

For example in R;

library(rtracklayer)

gff.path <- "~/GCF_000002235.5_Spur_5.0_genomic.gff"

gff.data <- rtracklayer::readGFF(gff.path)

gff.by.gene <- split(gff.data, gff.data$gene)
gene.len.long <- lapply(gff.by.gene, function(x) x[order(x$end - x$start),][1,])

#blah...
#blah...
#blah...
ADD COMMENT
0
Entering edit mode
3.2 years ago
Pratik ★ 1.1k

I was just doing something similar about a week ago. You may be able to accomplish this using the GenomicFeatures R package.

First load up the following in R:

library(GenomicFeatures)
library(GenomicRanges)
library(rtracklayer)

Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)

then read in that file into R with:

chrominfo <- read.table(file = 'your/file/path/sizes.genome', sep = '\t')
colnames(chrominfo) <- c("chrom", "length")

then you should be able to plug it into GenomicFeatures using (you might have to download the gtf file instead from the NCBI link you provided, because I think GenomicFeatures only supports gff3 and gtf file formats):

purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus", 
                                              format = "gtf",
                                              file =  "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf", 
                                              chrominfo = chrominfo)

and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):

exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
rtracklayer::export(exons,'/your/file/path/here/exons.bed')

as for (2) One record for the longest transcript of each protein-coding gene:

transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
rtracklayer::export(transcripts,'your/file/path/here/transcripts.bed')

Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.

I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don't think I obtained all the elements when compared to just obtaining the txdb object from ensembl via makeTxDbFromEnsembl

EDIT: Sept. 10 2021 - 17:42EST - Nevermind on the information below: I checked organisms <- GenomeInfoDb::listOrganisms(), and I don't see the Strongylocentrotus purpuratus on the list. Therefore, I think the information below will not work...

With the above being said, there may be a way to make a txdb from ensembl directly...:

It may be something like this:

purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")

and then you could continue with exons <-...

I do see that ensembl does have the information for it, just not sure exactly how to input it into makeTxDbFromEnsembl: https://metazoa.ensembl.org/Strongylocentrotus_purpuratus/Info/Index?db=core

This might help you find the correct server address? https://useast.ensembl.org/info/data/mysql.html

ADD COMMENT
0
Entering edit mode
3.2 years ago
vkkodali_ncbi ★ 3.8k

For a Unix command solution, see: Intron and exon coordinates of all genes from Hg19

ADD COMMENT

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6