gff3 to CDS fasta
6
3
Entering edit mode
8.2 years ago
n.caillou ▴ 50

Given a genomic fasta file and a GFF3 file like this (this is just one transcript):

groupI  AUGUSTUS        gene    24128   27467   1       +       .       ID=g2
groupI  AUGUSTUS        transcript      24128   27467   .       +       .       ID=g2.t1;Parent=g2
groupI  AUGUSTUS        start_codon     24128   24130   .       +       0       Parent=g2.t1
groupI  AUGUSTUS        CDS     24128   24311   .       +       0       ID=g2.t1.cds;Parent=g2.t1
groupI  AUGUSTUS        CDS     25287   25354   .       +       2       ID=g2.t1.cds;Parent=g2.t1
groupI  AUGUSTUS        CDS     27267   27467   .       +       0       ID=g2.t1.cds;Parent=g2.t1
groupI  AUGUSTUS        stop_codon      27465   27467   .       +       0       Parent=g2.t1

Is there a simple way to extract a fasta of the CDS (or of the translated sequence) for each transcript?

I have already seen:

Isn't there something more simple and reliable than the above custom scripts?

genome sequence gff3 • 19k views
ADD COMMENT
1
Entering edit mode

Did you try bedtools getfasta ?

ADD REPLY
0
Entering edit mode

I did, see my answer to WouterDeCoster below. I cannot get per-transcript CDS sequences, I just get each element (gene, transcript, start/end codons, and each CDS fragment) separately. Is this the expected result?

ADD REPLY
0
Entering edit mode

I see what you mean. This might help you A: Gencode Gtf To Bed12 Or Psl

ADD REPLY
4
Entering edit mode
8.2 years ago
A. Domingues ★ 2.7k

Are you happy with R/Bioconductor? There is a solution here which could solve your problem. Look for item #6 of 'Alignments (and genomic annotations)'. I will transcribe the relevant (but incomplete) bits of code here for convenience, but all credits go to Martin Morgan.

## cds for all transcripts
cdsByTx <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

## reference genome
library(BSgenome.Hsapiens.UCSC.hg19)

dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx)
protein <- translate(dna)
ADD COMMENT
0
Entering edit mode

If I am going to do a non-trivial amount of scripting I'd probably go with Biopython. Also, would this work with any fasta/gff3 pair of files?

ADD REPLY
0
Entering edit mode

Assuming the gff3 is well formatted, that is, contains the CDS information, yes. Of course, when it comes to bioinformatics assuming a file is well formatted and that things will go as expected it a big ask so ymmv.

ADD REPLY
3
Entering edit mode
8.2 years ago

I would suggest to try if bedtools getfasta fit your needs.

ADD COMMENT
0
Entering edit mode

Thanks, but actually, I have also tried getfasta. And I can only get the sequences of each element separately.

Would the -split option do what I want? Apparently it will only work if the input is BED. How do I convert my GFF3 to BED? Which other tool do I have to use besides getfasta?

edit: The converter is bedops's gff2bed, I suppose. It didn't help, though (same result).

ADD REPLY
0
Entering edit mode

And can't you pre filter your gff to only contain 'transcript' features, e.g. using grep "transcript" yourfile.gff > onlytranscript.gff?

ADD REPLY
0
Entering edit mode

I can do that to have all the CDS and just them (that's part of the "shell tricks" in my answer below), but then I would have to piece them together. Definitely possible but not simple either.

ADD REPLY
0
Entering edit mode

But the transcript annotation in a gff is the full transcript, entire CDS of a gene, right? I don't understand what you would need to paste together.

ADD REPLY
1
Entering edit mode

As I understand the question of n.caillou, he wants to extract coding sequences, not transcript sequence (i.e. he needs to paste together exons for multi-exon genes)...

ADD REPLY
1
Entering edit mode
8.2 years ago

The xtractore program from the AEGeAn Toolkit should work for this. Set --type=CDS on the command line. For CDSs and other discontiguous features, xtractore pastes the fragments together correctly, taking into account strand, etc.

ADD COMMENT
1
Entering edit mode

Hello Daniel,

I used xtractore to extract CDS from a gff file. It does extract CDS taking into account strand and so on, but it does not paste CDS fragments belonging to the same gene together. Am I missing sth?

Also, the -w option does not seem to work properly:

xtractore -t CDS -w 0 -o timema_cristinae.fa timema_cristinae.gff Timema_cristinae_scaffolds.fas

Setting -w to zero gives me an empty output fasta file. In the documentation it is mentioned: "-w|--width: INT width of each line of sequence in the Fasta output; default is 80; set to 0 for no formatting"

Anyways, thanx for your program, it would take 2-3 days to write a CDS extractor myself :P

ADD REPLY
0
Entering edit mode

This would work, thanks. But isn't there something more standard? It is very impractical for me to install new software.

ADD REPLY
1
Entering edit mode

I'm not sure I understand your constraints. If you are uncomfortable evaluating custom Perl/Python scripts and if installing new software is impractical for you, you may be in the wrong field!

I'm not trying to be witty or smug, I'm being completely frank. The bioinformatics state of the art unfortunately provides very few clean solutions; almost always you must get your hands a bit dirty to piece together a working solution.

ADD REPLY
0
Entering edit mode
8.2 years ago
n.caillou ▴ 50

Using bedops and bedtools and shell tricks I can get a TSV file like:

g2.t1.cds   ATGAAGACGACGCTTGC...
g2.t1.cds   AGCGGAGTACCACGTGT...
g2.t1.cds   CACCACCGCGACCCC...TGA

But it is just a pile of hacks, and it is not even finished.

ADD COMMENT
0
Entering edit mode
3.5 years ago
FatihSarigol ▴ 260

Best solution to this that I know is:

gff3_file_to_proteins.pl File.gff File.fasta CDS > File.CDS.fa

You can also get complete sequences of (converted) proteins using prot instead of using CDS or cDNAs using cDNA or genes using gene.

When you get the sequence for each CDS separately for instance with bedtools getfasta, first of all you get each piece of a CDS separately rather than getting the CDS sequences of a gene combined, and second for genes with isoforms, you can get the same region more than once in overlapping sequences which is not good if you are trying to analyze the coding region of a genome. As far as I managed to understand, this script gets the longest isoform version possible for an exon, and merges them for each gene.

This is the script, but you will need to download the entire program for it to work since it uses (at least) one other script in another folder. https://github.com/EVidenceModeler/EVidenceModeler/blob/master/EvmUtils/gff3_file_to_proteins.pl

The script is designed for the EVidenceModeler pipeline, so it doesn't have to work with the gffs coming from all different annotation pipelines, but so far it was working well for me, see this post if you also run into the error that I just faced with one gff from RefSeq: https://github.com/EVidenceModeler/EVidenceModeler/issues/45

ADD COMMENT
0
Entering edit mode
3.5 years ago
Juke34 9.0k

Digging into old thread @FatihSarigol ^^.
Yes Brian Haas is providing great scripts/tools

Since 2019 there is also agat_sp_extract_sequences.plfrom AGAT.
See this thread for examples: Extracting genomic feature sequences from GTF/GFF files with AGAT

ADD COMMENT
0
Entering edit mode

Yes :D I found this thread via google looking for an alternative when I ran into an error with the EVM script I posted, and noticed no one else posted about it. Very good and useful script, kudos to Brian indeed. I will also try AGAT next time too, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 3970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6