Extracting Sequences from a FASTA File Using Exon Coordinates from a GTF File
3
0
Entering edit mode
3 months ago
Milica ▴ 20

I have a .gtf file containing exon coordinates for all chromosomes of a species and a corresponding .fa file. I need to extract the correct sequences from the FASTA file based on the exon start and end positions provided in the .gtf file.

Can anyone suggest the best way to do this? Are there any existing tools or scripts that can help with this process?

gtf fasta • 535 views
ADD COMMENT
2
Entering edit mode
3 months ago
awk -F '\t' '($3=="exon") {printf("%s\t%d\t%s\n",$1,int($4)-1,$5);}' in.gtf | sort | uniq > exons.bed

bedtools getfasta -fi ref.fa -bed exons.bed
ADD COMMENT
2
Entering edit mode
3 months ago

Could also consider using the AGAT tool kit . (specifically the extract_sequences sub part)

ADD COMMENT
1
Entering edit mode
3 months ago

gffread is a decent option here. I'd choose that over custom awk scripts for speed and consistency. https://github.com/gpertea/gffread

ADD COMMENT

Login before adding your answer.

Traffic: 3577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6