Question

FASTA to GTF/GFF using reference genome

0

Entering edit mode

2.3 years ago

Prangan ▴ 20

Greetings!

I am working on validating a bioinformatics workflow for identification of novel lncRNAs. For my validation procedure, I have thought of deleting a portion of annotated lncRNAs from the (human) reference transcriptome and running my workflow to recover them as novel. However, I would also require a GTF/GFF file containing the annotation information of the reference transcriptome (minus the deleted lncRNA transcript coordinates). I have thought of manually deleting the transcript coordinates from the human annotation (GTF) file but that would prove to be cumbersome. Is there any tool to convert my transcriptome (fasta) into its corresponding annotation (GTF)? If not, are there any other alternatives which can be taken to resolve my issue? As always, any and all help is highly appreciated. Thank you.

validation lncRNA GTF conversion FASTA • 3.0k views

ADD COMMENT • link updated 2.3 years ago by Istvan Albert 102k • written 2.3 years ago by Prangan ▴ 20

1

Entering edit mode

2.3 years ago

liorglic ★ 1.5k

FASTA files cannot be simply "converted" to GTF/GFF, since they do not contain genomic coordinates information. The sequences may be mapped to the reference genome, and the mapping results could be used to create gene annotations. However, I don't think you want to include that in a validation process since it will add confounding noise and biases.
What you can do is write some script that will filter the GTF based on gene IDs in the FASTA headers. For example, in python you can use Biopython's SeqIO to read the FASTA, and gffutils to read/write GTF.

ADD COMMENT • link 2.3 years ago by liorglic ★ 1.5k

score 2 · Accepted Answer · 2022-11-01

I would say this question depends mainly on whether you have a way to programmatically identify the lncRNAs

filtering a GFF file should not be overly cumbersome - pattern matching on the second column might suffice; something like

(grep '^#' features.gff && awk '$3~/lncRNA/ { print $0 }' )> filtered.gff

similarly, extracting patterns from a FASTA file should also be simple, with samtools faidx where you can pass a regions file to extract a selected subset.

   samtools faidx input.fasta

Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]
Option: 
 -o, --output FILE        Write FASTA to file.
 -n, --length INT         Length of FASTA sequence line. [60]
 -c, --continue           Continue after trying to retrieve missing region.
 -r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.

in that case, create a file that contains only the desired sequence ids

perhaps extract the sequence ids from the GGF file with a matching pattern, here it depends on how the GFF file is encode.