Question

Extracting exons and transcripts from gff3/gtf

0

Entering edit mode

3.6 years ago

mglasena ▴ 40

I'm hoping to create the following bed files from the purple urchin genome assembly:

(1) One record for each unique, non-overlapping exon

(2) One record for the longest transcript of each protein-coding gene.

I'm also hoping to eventually create sequence alignments for each transcript with the introns removed.

Are there software programs that help with this? The UCSC genome table browser does not allow you to select only one transcript per gene.

The NCBI Annotation report says that there are 245,575 exons in coding transcripts. I am not able to hit the number using simple UNIX commands. For example:

wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | grep -i -e "XM_" -e "NM_" | cut -f 1,4,5 | bedtools sort -i - | uniq | wc -l 
247301

enter image description here

assembly gtf annotation gff3 • 7.3k views

ADD COMMENT • link 3.6 years ago by mglasena ▴ 40

0

Entering edit mode

You may want to check the following files from : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/

  *_cds_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all CDS 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.

*_rna.fna.gz file
       FASTA format of accessioned RNA products annotated on the genome 
       assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA 
       products are not instantiated as a separate accessioned record in GenBank
       but are provided for some RefSeq genomes, most notably the eukaryotes.)
       The FASTA title is provided as sequence accession.version plus 
       description.

   *_rna_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all RNA 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.

ADD REPLY • link 3.6 years ago by GenoMax 151k

0

Entering edit mode

3.6 years ago

noodle ▴ 650

Depending on exactly what you want to extract, just import into R/python, pull out whatever you want, and save as bed...just make sure your questions are biological 'legit' because excluding isoforms and/or overlapping genes like you ask seems odd.

For example in R;

library(rtracklayer)

gff.path <- "~/GCF_000002235.5_Spur_5.0_genomic.gff"

gff.data <- rtracklayer::readGFF(gff.path)

gff.by.gene <- split(gff.data, gff.data$gene)
gene.len.long <- lapply(gff.by.gene, function(x) x[order(x$end - x$start),][1,])

#blah...
#blah...
#blah...

ADD COMMENT • link 3.6 years ago by noodle ▴ 650

0

Entering edit mode

3.6 years ago

Pratik ★ 1.1k

I was just doing something similar about a week ago. You may be able to accomplish this using the GenomicFeatures R package.

First load up the following in R:

library(GenomicFeatures)
library(GenomicRanges)
library(rtracklayer)

Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)

then read in that file into R with:

chrominfo <- read.table(file = 'your/file/path/sizes.genome', sep = '\t')
colnames(chrominfo) <- c("chrom", "length")

then you should be able to plug it into GenomicFeatures using (you might have to download the gtf file instead from the NCBI link you provided, because I think GenomicFeatures only supports gff3 and gtf file formats):

purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus", 
                                              format = "gtf",
                                              file =  "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf", 
                                              chrominfo = chrominfo)

and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):

exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
rtracklayer::export(exons,'/your/file/path/here/exons.bed')

as for (2) One record for the longest transcript of each protein-coding gene:

transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
rtracklayer::export(transcripts,'your/file/path/here/transcripts.bed')

Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.

I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don't think I obtained all the elements when compared to just obtaining the txdb object from ensembl via makeTxDbFromEnsembl

EDIT: Sept. 10 2021 - 17:42EST - Nevermind on the information below: I checked organisms <- GenomeInfoDb::listOrganisms(), and I don't see the Strongylocentrotus purpuratus on the list. Therefore, I think the information below will not work...

With the above being said, there may be a way to make a txdb from ensembl directly...:

It may be something like this:

purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")

and then you could continue with exons <-...

I do see that ensembl does have the information for it, just not sure exactly how to input it into makeTxDbFromEnsembl: https://metazoa.ensembl.org/Strongylocentrotus_purpuratus/Info/Index?db=core

This might help you find the correct server address? https://useast.ensembl.org/info/data/mysql.html

ADD COMMENT • link 3.6 years ago by Pratik ★ 1.1k

0

Entering edit mode

3.6 years ago

vkkodali_ncbi ★ 3.8k

For a Unix command solution, see: Intron and exon coordinates of all genes from Hg19

ADD COMMENT • link 3.6 years ago by vkkodali_ncbi ★ 3.8k

score 3 · Accepted Answer · 2021-09-11

3

Entering edit mode

3.6 years ago

Juke34 9.2k

Extracting genomic feature sequences from GTF/GFF files

ADD COMMENT • link 3.6 years ago by Juke34 9.2k

0

Entering edit mode

Thanks so much for sharing this! This program (AGAT) is brilliant. It can do everything I asked about in the question posting and avoids the need to write custom code.

ADD REPLY • link 3.6 years ago by mglasena ▴ 40