I was just doing something similar about a week ago.
You may be able to accomplish this using the GenomicFeatures
R package.
First load up the following in R:
library(GenomicFeatures)
library(GenomicRanges)
library(rtracklayer)
Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)
then read in that file into R with:
chrominfo <- read.table(file = 'your/file/path/sizes.genome', sep = '\t')
colnames(chrominfo) <- c("chrom", "length")
then you should be able to plug it into GenomicFeatures
using (you might have to download the gtf file instead from the NCBI link you provided, because I think GenomicFeatures
only supports gff3
and gtf
file formats):
purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus",
format = "gtf",
file = "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf",
chrominfo = chrominfo)
and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):
exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
rtracklayer::export(exons,'/your/file/path/here/exons.bed')
as for (2) One record for the longest transcript of each protein-coding gene:
transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
rtracklayer::export(transcripts,'your/file/path/here/transcripts.bed')
Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.
I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don't think I obtained all the elements when compared to just obtaining the txdb object from ensembl via makeTxDbFromEnsembl
EDIT: Sept. 10 2021 - 17:42EST - Nevermind on the information below: I checked organisms <- GenomeInfoDb::listOrganisms()
, and I don't see the Strongylocentrotus purpuratus
on the list. Therefore, I think the information below will not work...
With the above being said, there may be a way to make a txdb from ensembl directly...:
It may be something like this:
purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")
and then you could continue with exons <-
...
I do see that ensembl does have the information for it, just not sure exactly how to input it into makeTxDbFromEnsembl
:
https://metazoa.ensembl.org/Strongylocentrotus_purpuratus/Info/Index?db=core
This might help you find the correct server address? https://useast.ensembl.org/info/data/mysql.html
You may want to check the following files from : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/235/GCF_000002235.5_Spur_5.0/