Gffread_How to obtain fasta files of gene features (ORF, UTR) by transcripts based on gff?
0
0
Entering edit mode
4.5 years ago
tianshenbio ▴ 180

I have a gff like this:

Bany_Scaf24 maker   gene    41357   46444   .   +   .   ID=Bany_09696;Name=Bany_09696;Alias=maker-Bany_Scaf24-exonerate_est2genome-gene-0.0;
Bany_Scaf24 maker   mRNA    41357   46444   .   +   .   ID=Bany_09696-RA;Parent=Bany_09696;Name=Bany_09696-RA;Alias=maker-Bany_Scaf24-exonerate_est2genome-gene-0.0-mRNA-1;_AED=0.00;_QI=277|1|1|1|0|0|2|35|76;_eAED=0.00;score=1829;
Bany_Scaf24 maker   exon    46189   46444   .   +   .   ID=Bany_09696-RA:1;Parent=Bany_09696-RA;
Bany_Scaf24 maker   exon    41357   41643   .   +   .   ID=Bany_09696-RA:2;Parent=Bany_09696-RA;
Bany_Scaf24 maker   five_prime_UTR  41357   41633   .   +   .   ID=Bany_09696-RA:five_prime_utr;Parent=Bany_09696-RA;
Bany_Scaf24 maker   CDS 41634   41643   .   +   0   ID=Bany_09696-RA:cds;Parent=Bany_09696-RA;
Bany_Scaf24 maker   CDS 46189   46409   .   +   2   ID=Bany_09696-RA:cds;Parent=Bany_09696-RA;
Bany_Scaf24 maker   three_prime_UTR 46410   46444   .   +   .   ID=Bany_09696-RA:three_prime_utr;Parent=Bany_09696-RA;
B

I know I could generate fasta files for each transcript using gffread. Here is how the result looks like:

>Bany_18948-RA CDS=213-977
TGTAGGTACCTCGATAAAAATACAATATTTAAAATGACAATACAAAAAATTGTCTATTTATAGTAGGTAT
AGATAGGTAGGTACGTTGAAAATCCAAAGTAAAATTGTACTTAGTACTACGTATTGATAATGAAACAATA
GATTATCATTTGATTAGGTAAATTACAAT

I assume each sequence represents the mature transcript (5'UTR, ORF(CDS),3'UTR), where 1-n represent the whole mature transcript, 1-212 represents the 5'UTR (excluding introns), 213-977 represents ORF (CDS) (excluding introns), and 978-n represents 3'UTR (excluding introns).

Am I right? I am not sure if introns within UTRs are indeed included in this sequence...I do have UTRs fragmented by introns.

If that is the case, how can I extract sequences of those lower level features by transcripts? For example, I hope to get a fasta file like this:

    >Bany_18948-RA 
TGTAGGTACCTCGATAAAAATACAATATTTAAAATGACAATACAAAAAATTGTCTATTTATAGTAGGTAT AGATAGGTAGGTACGTTGAAAATCCAAAGTAAAATTGTACTTAGTACTACGTATTGATAATGAAACAATA GATTATCATTTGATTAGGTAAATTACAAT

where the sequence only represents the 3'UTR region (or ORF region, 5'UTR) of mature transcript Bany_18948-RA, excluding introns.

If it cannot be done using gffread, is there any other tools for this? Thank you.

RNA-Seq Assembly genome sequence gene • 1.8k views
ADD COMMENT
0
Entering edit mode

I don’t know for gffread but same answer as before you can do it with agat_sp_extract_sequences.pl from AGAT
For cds:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -o cds.fa

for five_prime_UTR features:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t five_prime_UTR -o result_five_prime.fa

for three_prime_UTR features:
agat_sp_extract_sequences.pl --gff input.gff --fasta input.fasta -t three_prime_UTR -o result_three_prime.fa

ADD REPLY
0
Entering edit mode

Thank you for your answer Juke34, will give it a try

ADD REPLY

Login before adding your answer.

Traffic: 1496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6