How to extract fasta sequences from assembled transcripts generated by Stringtie
4
3
Entering edit mode
7.8 years ago
seta ★ 1.9k

Hi all,

I used STAR and stringtie for mapping reads to reference genome and assembly. As you know, the generated assembled transcripts by stringtie are in gtf format. Now, I want to have fasta sequence of assembled transcript. I used gffread, but all sequences had the same header! maybe it's not compatible with stringtie.

Could you please help me out to convert assembled transcripts by stringtie in gtf format to fasta format?

Thanks

fasta gtf stringtie • 12k views
ADD COMMENT
0
Entering edit mode

use gffread, you can find it in cufflink package

ADD REPLY
4
Entering edit mode
5.4 years ago
Juke34 9.1k

I use agat_sp_extract_sequences.pl from AGAT.

agat_sp_extract_sequences.pl --cdna --gff input.gtf --fasta genome.fa -o output.fa

ADD COMMENT
0
Entering edit mode

thanks for your response

ADD REPLY
0
Entering edit mode

Hello Juke34 , is it possible to use output.fa in maker (est= #set of ESTs or assembled mRNA-seq in fasta format). I got my gtf with StringTie merged, I would appreciate your response.

ADD REPLY
1
Entering edit mode

Yes sure, but if it the same genome you can directly use the gtf/GFF as input of est_gff=

ADD REPLY
0
Entering edit mode

Thank you for you answer Juke34 , so it is not necessary convert to the gtf from StringTie to GFF o maybe use agat_convert_sp_gxf2gxf.pl for obtain my gff3? My version is maker/2.31. Thank you again.

ADD REPLY
1
Entering edit mode

Sorry yes if it’s a GTF you have to convert to GFF. But it was long time ago maybe recent version allows to use GTF directly?

ADD REPLY
0
Entering edit mode

Thank you Juke34 , I used agat for convert my gtf. Sorry, I have other doubt, I'm starting with plant annotation and I'm using maker for the first time, however I have some doubts on how to configure my maker_opts.ctl file for my first round. I intend to do this first round with my previously masked genome (RepeatModeler and RepeatMasker), RNA-seq and proteins. My doubt is mainly in the masking section, I don't know what would be the correct way to set these parameters if I'm already providing a softmask genome:

When I set it as follows:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/NFS/LUSTRE/storage/data/software/maker-2.31/maker/data/te_proteins.fasta #provide a fasta file of transposable eleme$rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

or softmask=1

I get this:

Processing run.log file...
MAKER WARNING: The file Plant_aura_var.Colimex.maker.output/Plant_aura_var.Colimex_datastore/B4/1C/scaffold_268//theVoid.scaffold_268/0/scaffold_268.0.all.rb.out
did not finish on the last run and must be erased
examining contents of the fasta file and run log

Can you help me? Thanks.

ADD REPLY
1
Entering edit mode

I do not know if the problem is related but if your genome is already masked you should let all the option empty:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable eleme$rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
ADD REPLY
3
Entering edit mode
7.4 years ago
zzqr ▴ 50

The stringtie_merged.gtf file have seqname, start, end strand info. So, you can use R GRanges object and getSeq function from GenomicRanges and BSgenome packages to retrive sequences.

ADD COMMENT
1
Entering edit mode
7.4 years ago

You can also use bedtools getfasta to fetch sequences from GTF or BED files.


UPDATE

Here is the perfect solution

ADD COMMENT
0
Entering edit mode

I used this, but I run into the following error

"Error (GFaSeqGet): subsequence cannot be larger than 465 Error getting subseq for gene1 (465..1503)!"

Did you had any issues using gffread?

Thanks

ADD REPLY
0
Entering edit mode

There is a Python script that fixes this error, you can follow A: gffread error when extracting transcript sequences from gtf, coordinates exceed

ADD REPLY
0
Entering edit mode
4.1 years ago
DNAvinci • 0

Per, http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread_ex

gffread \
-w assembled_transcripts.fa \
-g ref_genome.fa \
-E cov_refs.gtf \
./stringtie_output.gtf
ADD COMMENT

Login before adding your answer.

Traffic: 4433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6