Hello!!
I'm trying to get the coding sequences from several reference-genome assemblies. The reference-genome assemblies were obtained wit: GATK, samtools mpileup, bcftools, vcfutils.pl and seqtk.
I can extract the CDS regions with bedtools and use the gff file from the reference genome, but I'm thinking that I could lost some regions of coding sequences if I only get the cds based on the reference genome.
I would like to find and extract those coding sequences of each consensus genome without use the genomic information of the reference genome.
I have been trying to get the CDS using: ESTScan and Transeq, but I would like to know if there is a best strategy to perform it.
Thank you so much
This really doesn't explain what you have done. I suspect you have several resequencing genomes, by the list of tools used. And you suspect some of these genomes will have additional genes in relation to the reference annotation?
Are you extracting CDS with Transeq and ESTscan from the whole genome sequence? That is not how they should be used, they are not the appropriate tools for the task.