I am working on a non-classical model (a coral species), so the genome I use is not completed. I currently have genome fasta sequence files in chromosome units (i.e. start with a '>' per chromosome) and an annotation file in gff3 format (gene, mRNA, CDS, and exon).
I currently want to get the sequence of each gene (i.e. start with a '>' per gene). I am currently using the following R code, which runs normally without any errors. But I am not sure if my code is flawed, and how to quickly and directly confirm that the file I output is the correct gene sequences.
If you are satisfied with my code, please let me know. If you have any concerns or suggestions, please let me know as well. I will be grateful.
library(GenomicRanges)
library(rtracklayer)
library(Biostrings)
genome <- readDNAStringSet("coral.fasta")
gff_data <- import("coral.gff3", format = "gff3")
genes <- gff_data[gff_data$type == "gene"]
gene_sequences <- lapply(seq_along(genes), function(i) { #extract gene sequence
chr <- as.character(seqnames(genes)[i])
start <- start(genes)[i]
end <- end(genes)[i]
strand <- as.character(strand(genes)[i])
gene_seq <- subseq(genome[[chr]], start = start, end = end)
if (strand == "-") {
gene_seq <- reverseComplement(gene_seq)}
return(gene_seq)})
names(gene_sequences) <- genes$ID #name each gene sequence
output_file <- "coral.gene.fasta"
writeXStringSet(DNAStringSet(gene_sequences), filepath = output_file)
While this looks as if it works fine, you should look into the vectorized getSeq method which allows to specify all coodinates at once as a GRanges or GRangesList object. This will save the apply part and reduce the process to 5 lines of code.