Hi,
I have a genome fasta file and corresponding gff annotation file, how can I extract all CDS from all of the protein-coding genes in the genome? I would prefer to use an actual program rather than a script to do this, as I would have a harder time modifying/troubleshooting a script than using a program.
Thanks!
Hi Kevin,
Thank you for responding. I tried gffread but didn't have any luck for some unknown reason. I built an index of my genome with Samtools and made sure that the index was in the same directory as my genome (I used the -g flag) and full path to the genome as well as the gff file, but I received "No fasta index found for...", so Cufflinks (v2.2.1) built me a new index and continued on with the program, however, the output file was completely empty, so I am unsure what went wrong. Moreover, I compared the index I generated with Samtools relative to the Cufflinks index and they were essentially identical.
I am confident that my code is correct (because I'm able to at least generate an empty fasta). Could the format of the gff file not be compatible with the genome fasta? My data is from NCBI, but it is of a recently published paper in Nature, so I'm unsure why there would be any discrepancies.
Hey, I never got that error, but you may have to also index the reference FASTA yourself using Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer) - I'm not sure though. Different programs look for different indices... SAMtools, picard, bowtie1, bowtie2, bwa,..., they all index differently.
Also check to ensure that your contigs match, i.e. the contig names in your GFF and FASTA reference. They may differ by the 'chr' prefix.
Kevin