Question

How to use GFF3 annotation to split genome fasta into gene sequence fasta in R

4

Entering edit mode

6 months ago

Chilly ★ 1.3k

I am working on a non-classical model (a coral species), so the genome I use is not completed. I currently have genome fasta sequence files in chromosome units (i.e. start with a '>' per chromosome) and an annotation file in gff3 format (gene, mRNA, CDS, and exon).

I currently want to get the sequence of each gene (i.e. start with a '>' per gene). I am currently using the following R code, which runs normally without any errors. But I am not sure if my code is flawed, and how to quickly and directly confirm that the file I output is the correct gene sequences.

If you are satisfied with my code, please let me know. If you have any concerns or suggestions, please let me know as well. I will be grateful.

library(GenomicRanges)
library(rtracklayer)
library(Biostrings)

genome <- readDNAStringSet("coral.fasta")

gff_data <- import("coral.gff3", format = "gff3")


genes <- gff_data[gff_data$type == "gene"]

gene_sequences <- lapply(seq_along(genes), function(i) { #extract gene sequence
  chr <- as.character(seqnames(genes)[i])
  start <- start(genes)[i]
  end <- end(genes)[i]
  strand <- as.character(strand(genes)[i])

  gene_seq <- subseq(genome[[chr]], start = start, end = end)

  if (strand == "-") {
    gene_seq <- reverseComplement(gene_seq)}

  return(gene_seq)})

names(gene_sequences) <- genes$ID #name each gene sequence

output_file <- "coral.gene.fasta"
writeXStringSet(DNAStringSet(gene_sequences), filepath = output_file)

R GFF3 fasta genome • 762 views

ADD COMMENT • link updated 6 months ago by Ram 44k • written 6 months ago by Chilly ★ 1.3k

1

Entering edit mode

While this looks as if it works fine, you should look into the vectorized getSeq method which allows to specify all coodinates at once as a GRanges or GRangesList object. This will save the apply part and reduce the process to 5 lines of code.

ADD REPLY • link 6 months ago by Michael 55k

1

Entering edit mode

6 months ago

dthorbur ★ 2.8k

Learning how to implement positive controls is extremely useful in bioinformatics. In this instance, I'm sure you can find some important genes where the sequences are on a database like NCBI. You can then compare the output of the known with your output.

Another useful skill is learning how to find existing tools so you don't have to re-invent the wheel with every analysis. There are already existing command line tools available that will extract transcript sequences from a reference genome and gff file. Here is a post on BioStars discussing exactly this.

I also suggest learning how to write minimal reproducible examples in your post on here or forums like stackoverflow. Supplying a small amount of data alongside code helps others quickly and accurately identify errors and solutions.

ADD COMMENT • link 6 months ago by dthorbur ★ 2.8k

score 6 · Accepted Answer · 2024-07-28

6

Entering edit mode

6 months ago

Chilly ★ 1.3k

Update: I found a tool called AGAT that was perfect for my needs.

ADD COMMENT • link 6 months ago by Chilly ★ 1.3k