I'm trying to use the program SNPGenie to determine which genes are undergoing positive selection based on SNP analysis. I already have the necessary .vcf, reference fasta, and gtf files necessary for input into the program. But the program is unable to run because some of the CDS lines in the GTF do not contain a multiple of 3. There are way too many genes to manually parse out the CDS lines that contain this problem. These CDS genes had the same nucleotide start and end point in the GFF I converted to the GTF. How would I go about parsing out these non-divisible by 3 CDS sequences.
Sorry for the long response, had a test today. Here's a little snippit of my gtf files.
CM000429 EuPathDB exon 46 743 . - . transcript_id "cgd1_5-RA"; gene_id "cgd1_5";
CM000429 EuPathDB exon 3067 4995 . - . transcript_id "cgd1_10-RA"; gene_id "cgd1_10";
CM000429 EuPathDB CDS 3682 4995 . - 0 transcript_id "cgd1_10-RA"; gene_id "cgd1_10";
CM000429 EuPathDB exon 5188 6651 . - . transcript_id "cgd1_20-RA"; gene_id "cgd1_20";
CM000429 EuPathDB CDS 5188 6504 . - 0 transcript_id "cgd1_20-RA"; gene_id "cgd1_20";
CM000429 EuPathDB exon 6974 11926 . + . transcript_id "cgd1_30-RA"; gene_id "cgd1_30";
CM000429 EuPathDB CDS 6974 11926 . + 0 transcript_id "cgd1_30-RA"; gene_id "cgd1_30";
CM000429 EuPathDB exon 12064 13215 . - . transcript_id "cgd1_40-RA"; gene_id "cgd1_40";
CM000429 EuPathDB CDS 12119 13132 . - 0 transcript_id "cgd1_40-RA"; gene_id "cgd1_40";
In every CDS line if the difference between in end point and start point ends in .333, then it crashes saying its not a multiple of 3. All good lines seem to end in .66 which means you have to add one to get the proper 3 codon length.
I think the problem you have with SNPGenie might not be related to CDS length not being a multiple of 3. What is the exact error message you get from SNPGenie?
In the SNPGenie usage instructions: https://github.com/chasewnelson/SNPGenie
It says that it reads the gene_id field to concatenate CDS lines, not the transcript_id. You probably have multiple transcripts (isoforms) of the same gene with the same gene_id, but different transcript_id. SNPGenie might be attempting to concatenate the various isoforms together and finding strange ORFs and throwing an error. SNPGenie also doesn't seem to work on genes that are on the minus strand, as the guide says you need to run minus strange genes separately.
To avoid any further discussion, can you post a small extract of the gtf file you're working with? What kind of species are we talking about? Is it transcriptomic data or genomic ? Do you also have the CDS sequences with this or only the gtf file
OP asked to parse out CDS features which are not divisible by 3. My answer accomplishes this. OP, please do give my answer a try despite lieven.sterck's unfounded criticism.
look, your answer does exactly what you say it will do, I'm not questioning that! I only want to point out that (in most cases) it does not make biological sense, given how those CDS features are present in a GFF file.
If OP wants to remove every line in the gff file that has the tag CDS and is not modulo3 then your answer will indeed do just fine. However, and once again, this makes in most cases no sense as it is the sum of all CDS features that makes the protein for a gene.
In the unlikely event that OP's program is complaining on each and indivdual line that says CDS and is not modulo3, then there is something fundamentally wrong with that software!
There is something wrong with the gtf/gff if there are multiple CDS for a single transcript. CDS should NOT be split into multiple exons. That's the whole point of having exon features. The standards for these files are such a joke.
Hi,
I think it's easy to get into personal territory when we start characterizing others' remarks without careful consideration. Let's de-escalate before a matter of intellectual debate becomes one of ad hominem attacks.
Sorry for the long response, had a test today. Here's a little snippit of my gtf files.
In every CDS line if the difference between in end point and start point ends in .333, then it crashes saying its not a multiple of 3. All good lines seem to end in .66 which means you have to add one to get the proper 3 codon length.