Question

Program that will extract all CDS from protein-coding genes in a genome fasta file (with gff file)

0

Entering edit mode

7.1 years ago

molly77 ▴ 10

Hi,

I have a genome fasta file and corresponding gff annotation file, how can I extract all CDS from all of the protein-coding genes in the genome? I would prefer to use an actual program rather than a script to do this, as I would have a harder time modifying/troubleshooting a script than using a program.

Thanks!

genome gene CDS • 3.1k views

ADD COMMENT • link updated 7.1 years ago by Kevin Blighe 88k • written 7.1 years ago by molly77 ▴ 10

score 0 · Answer 1 · 2017-10-10

0

Entering edit mode

7.1 years ago

Kevin Blighe 88k

The gffread function that comes bundled with Cufflinks will take a GTF/GFF transcript file and a genome FASTA file as input, and then produce a new FASTA file covering just the regions specified in the input GTF/GFF. For more information, see here: http://ccb.jhu.edu/software/stringtie/gff.shtml

/Programs/cufflinks-2.2.1.Linux_x86_64/gffread -w CDS.fasta -W -O -E -L -F -g ReferenceGenome.fasta MyCDStranscripts.gtf

Other suggestions here: gff3 to CDS fasta

ADD COMMENT • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

Thank you for responding. I tried gffread but didn't have any luck for some unknown reason. I built an index of my genome with Samtools and made sure that the index was in the same directory as my genome (I used the -g flag) and full path to the genome as well as the gff file, but I received "No fasta index found for...", so Cufflinks (v2.2.1) built me a new index and continued on with the program, however, the output file was completely empty, so I am unsure what went wrong. Moreover, I compared the index I generated with Samtools relative to the Cufflinks index and they were essentially identical.

I am confident that my code is correct (because I'm able to at least generate an empty fasta). Could the format of the gff file not be compatible with the genome fasta? My data is from NCBI, but it is of a recently published paper in Nature, so I'm unsure why there would be any discrepancies.

ADD REPLY • link 7.1 years ago by molly77 ▴ 10

0

Entering edit mode

Hey, I never got that error, but you may have to also index the reference FASTA yourself using Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer) - I'm not sure though. Different programs look for different indices... SAMtools, picard, bowtie1, bowtie2, bwa,..., they all index differently.

Also check to ensure that your contigs match, i.e. the contig names in your GFF and FASTA reference. They may differ by the 'chr' prefix.

Kevin

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k