How to use GFF3 annotation to split genome fasta into gene sequence fasta in R
2
4
Entering edit mode
5 months ago
Chilly ★ 1.3k

I am working on a non-classical model (a coral species), so the genome I use is not completed. I currently have genome fasta sequence files in chromosome units (i.e. start with a '>' per chromosome) and an annotation file in gff3 format (gene, mRNA, CDS, and exon).

I currently want to get the sequence of each gene (i.e. start with a '>' per gene). I am currently using the following R code, which runs normally without any errors. But I am not sure if my code is flawed, and how to quickly and directly confirm that the file I output is the correct gene sequences.

If you are satisfied with my code, please let me know. If you have any concerns or suggestions, please let me know as well. I will be grateful.

library(GenomicRanges)
library(rtracklayer)
library(Biostrings)

genome <- readDNAStringSet("coral.fasta")

gff_data <- import("coral.gff3", format = "gff3")


genes <- gff_data[gff_data$type == "gene"]

gene_sequences <- lapply(seq_along(genes), function(i) { #extract gene sequence
  chr <- as.character(seqnames(genes)[i])
  start <- start(genes)[i]
  end <- end(genes)[i]
  strand <- as.character(strand(genes)[i])

  gene_seq <- subseq(genome[[chr]], start = start, end = end)

  if (strand == "-") {
    gene_seq <- reverseComplement(gene_seq)}

  return(gene_seq)})

names(gene_sequences) <- genes$ID #name each gene sequence

output_file <- "coral.gene.fasta"
writeXStringSet(DNAStringSet(gene_sequences), filepath = output_file)
R GFF3 fasta genome • 670 views
ADD COMMENT
1
Entering edit mode

While this looks as if it works fine, you should look into the vectorized getSeq method which allows to specify all coodinates at once as a GRanges or GRangesList object. This will save the apply part and reduce the process to 5 lines of code.

ADD REPLY
6
Entering edit mode
5 months ago
Chilly ★ 1.3k

Update: I found a tool called AGAT that was perfect for my needs.

ADD COMMENT
1
Entering edit mode
5 months ago
dthorbur ★ 2.6k

Learning how to implement positive controls is extremely useful in bioinformatics. In this instance, I'm sure you can find some important genes where the sequences are on a database like NCBI. You can then compare the output of the known with your output.

Another useful skill is learning how to find existing tools so you don't have to re-invent the wheel with every analysis. There are already existing command line tools available that will extract transcript sequences from a reference genome and gff file. Here is a post on BioStars discussing exactly this.

I also suggest learning how to write minimal reproducible examples in your post on here or forums like stackoverflow. Supplying a small amount of data alongside code helps others quickly and accurately identify errors and solutions.

ADD COMMENT

Login before adding your answer.

Traffic: 3461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6