Extract CDS from maker gff
2
0
Entering edit mode
3.4 years ago
Sarah ▴ 60

Hello,

I have annotated several genomes using the Maker2 pipeline with the goal of estimating dN/dS ratios for many genes. I have the gff files, and I would like to extract just the coding sequences into a fasta file. Previously I have been using the fasta_merge script that comes with maker, but I just noticed that the nucleotide sequences that it outputs includes the 5' and 3' UTRs and they are not always in the correct reading frame, which leads to some alignment issues in downstream steps. Is there a script similar to fasta_merge but which will output CDS sequences in-frame (cutting off incomplete codons) and without UTRs?

Thank you!

CDS maker2 annotation gff • 2.6k views
ADD COMMENT
3
Entering edit mode
3.4 years ago

Have a look at the AGAT package. That will certainly include some script to extract features from a GFF file.

more info here and specifically this one

ADD COMMENT
1
Entering edit mode

Thank you so much! agat_sp_extract_sequences.pl looks like exactly what I need!

ADD REPLY
1
Entering edit mode

Just in case anybody else comes across this post with the same problem: this worked great but the GFF output from Maker2 needed to be filtered first. Parsing the gff output from maker2 was extremely slow (took >53 hours), but then it took only 3 minutes on a filtered gff. I needed to remove unnecessary gff entries that maker2 had included (match, expressed_sequence_match, protein_match, contig).

For example:

grep -Pv "\tmatch_part\t" Maker2_output.gff | grep -Pv "\tprotein_match\t" | grep -Pv "\texpressed_sequence_match\t" | grep -Pv "\tmatch\t" | grep -Pv "\tcontig\t" > streamlined_for_AGAT.gff

And also the genome sequence needed to be folded to avoid errors:

fold genome.fasta > genome_folded.fasta

Then I could run:

 agat_sp_extract_sequences.pl --gff streamlined_for_AGAT.gff -f genome_folded.fasta -t cds --remove_orf_offset -o output_cds.fasta 
ADD REPLY
0
Entering edit mode

This code has been helpful but i have a question. After running these lines I have the output_cds.fasta, but few of the sequences begin with a start codon. Is this normal?

ADD REPLY
0
Entering edit mode

hmmm, no and yes ...

in the perfect (theoretical) case: no all CDS should start with a start codon.

In real life case: possible, Maker is known to predict genes without start codon (due to reasons that will take us too far) , so yes in that case it is possible to end up with CDS without start.

But this is solely determined by the info from the GFF file, that is nothing that AGAT will do/change. So the only way to (re)solve this is to amend the info in the GFF file so that the predictions start at a start codon position

ADD REPLY
1
Entering edit mode
3.4 years ago
prasundutta87 ▴ 670

This tool will be helpful: https://gffutils.readthedocs.io/en/v0.9.1/GTF_extract.html

You can test "--feature=FEATURE_TYPE" parameter to limit yourself to CDS.

ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6