SNPGenie: CDS does not have length that is divisible by 3
3
0
Entering edit mode
6.2 years ago
gcooper1245 ▴ 10

I'm trying to use the program SNPGenie to determine which genes are undergoing positive selection based on SNP analysis. I already have the necessary .vcf, reference fasta, and gtf files necessary for input into the program. But the program is unable to run because some of the CDS lines in the GTF do not contain a multiple of 3. There are way too many genes to manually parse out the CDS lines that contain this problem. These CDS genes had the same nucleotide start and end point in the GFF I converted to the GTF. How would I go about parsing out these non-divisible by 3 CDS sequences.

Sorry for the long response, had a test today. Here's a little snippit of my gtf files.

CM000429 EuPathDB exon 46 743 . - . transcript_id "cgd1_5-RA"; gene_id "cgd1_5"; 
CM000429 EuPathDB exon 3067 4995 . - . transcript_id "cgd1_10-RA"; gene_id "cgd1_10"; 
CM000429 EuPathDB CDS 3682 4995 . - 0 transcript_id "cgd1_10-RA"; gene_id "cgd1_10"; 
CM000429 EuPathDB exon 5188 6651 . - . transcript_id "cgd1_20-RA"; gene_id "cgd1_20"; 
CM000429 EuPathDB CDS 5188 6504 . - 0 transcript_id "cgd1_20-RA"; gene_id "cgd1_20"; 
CM000429 EuPathDB exon 6974 11926 . + . transcript_id "cgd1_30-RA"; gene_id "cgd1_30"; 
CM000429 EuPathDB CDS 6974 11926 . + 0 transcript_id "cgd1_30-RA"; gene_id "cgd1_30"; 
CM000429 EuPathDB exon 12064 13215 . - . transcript_id "cgd1_40-RA"; gene_id "cgd1_40"; 
CM000429 EuPathDB CDS 12119 13132 . - 0 transcript_id "cgd1_40-RA"; gene_id "cgd1_40";

In every CDS line if the difference between in end point and start point ends in .333, then it crashes saying its not a multiple of 3. All good lines seem to end in .66 which means you have to add one to get the proper 3 codon length.

SNP • 6.0k views
ADD COMMENT
1
Entering edit mode

I think the problem you have with SNPGenie might not be related to CDS length not being a multiple of 3. What is the exact error message you get from SNPGenie?

In the SNPGenie usage instructions: https://github.com/chasewnelson/SNPGenie

It says that it reads the gene_id field to concatenate CDS lines, not the transcript_id. You probably have multiple transcripts (isoforms) of the same gene with the same gene_id, but different transcript_id. SNPGenie might be attempting to concatenate the various isoforms together and finding strange ORFs and throwing an error. SNPGenie also doesn't seem to work on genes that are on the minus strand, as the guide says you need to run minus strange genes separately.

ADD REPLY
0
Entering edit mode

To avoid any further discussion, can you post a small extract of the gtf file you're working with? What kind of species are we talking about? Is it transcriptomic data or genomic ? Do you also have the CDS sequences with this or only the gtf file

ADD REPLY
0
Entering edit mode

OP asked to parse out CDS features which are not divisible by 3. My answer accomplishes this. OP, please do give my answer a try despite lieven.sterck's unfounded criticism.

ADD REPLY
1
Entering edit mode

look, your answer does exactly what you say it will do, I'm not questioning that! I only want to point out that (in most cases) it does not make biological sense, given how those CDS features are present in a GFF file.

If OP wants to remove every line in the gff file that has the tag CDS and is not modulo3 then your answer will indeed do just fine. However, and once again, this makes in most cases no sense as it is the sum of all CDS features that makes the protein for a gene.

In the unlikely event that OP's program is complaining on each and indivdual line that says CDS and is not modulo3, then there is something fundamentally wrong with that software!

ADD REPLY
0
Entering edit mode

There is something wrong with the gtf/gff if there are multiple CDS for a single transcript. CDS should NOT be split into multiple exons. That's the whole point of having exon features. The standards for these files are such a joke.

ADD REPLY
1
Entering edit mode

Hi,

I think it's easy to get into personal territory when we start characterizing others' remarks without careful consideration. Let's de-escalate before a matter of intellectual debate becomes one of ad hominem attacks.

ADD REPLY
0
Entering edit mode

Sorry for the long response, had a test today. Here's a little snippit of my gtf files.

CM000429    EuPathDB    exon    46  743 .   -   .   transcript_id "cgd1_5-RA"; gene_id "cgd1_5";
CM000429    EuPathDB    exon    3067    4995    .   -   .   transcript_id "cgd1_10-RA"; gene_id "cgd1_10";
CM000429    EuPathDB    CDS 3682    4995    .   -   0   transcript_id "cgd1_10-RA"; gene_id "cgd1_10";
CM000429    EuPathDB    exon    5188    6651    .   -   .   transcript_id "cgd1_20-RA"; gene_id "cgd1_20";
CM000429    EuPathDB    CDS 5188    6504    .   -   0   transcript_id "cgd1_20-RA"; gene_id "cgd1_20";
CM000429    EuPathDB    exon    6974    11926   .   +   .   transcript_id "cgd1_30-RA"; gene_id "cgd1_30";
CM000429    EuPathDB    CDS 6974    11926   .   +   0   transcript_id "cgd1_30-RA"; gene_id "cgd1_30";
CM000429    EuPathDB    exon    12064   13215   .   -   .   transcript_id "cgd1_40-RA"; gene_id "cgd1_40";
CM000429    EuPathDB    CDS 12119   13132   .   -   0   transcript_id "cgd1_40-RA"; gene_id "cgd1_40";

In every CDS line if the difference between in end point and start point ends in .333, then it crashes saying its not a multiple of 3. All good lines seem to end in .66 which means you have to add one to get the proper 3 codon length.

ADD REPLY
1
Entering edit mode
6.2 years ago

parsing them out is one thing, but if there are many you're gonna loose out on lots of data.

Unfortunately the 'good' solution would require going in manual and correct those genes.

An intermediate alternative solution might be an option here as well though: you could for instance trim back, from the end, those genes until they are mudulo3. Doing so you will only loose very little data as we're talking about 1 or 2 nucleotides and the rest of the gene is still usable. An ever better approach (but perhaps a bit overkill for your purposes) is to get the protein sequences, check if the start is present or the stop and then trim back from the missing end.

This is however not something you will be able to tackle with a few lines of code and will thus require some scripting

ADD COMMENT
0
Entering edit mode
6.2 years ago
goodez ▴ 640

Only keep CDS with multiples of 3:

awk '($5 - $4) % 3 == 0' file.gtf > new_file.gtf
ADD COMMENT
2
Entering edit mode

Neither will make sense I'm afraid. it will only filter out the lines for which stop-start is not modulo3. If a gene would not be modulo3 you should remove the complete gene from the gtf file which might be more then only the lines that are not modulo3. On the other hand a valid gene will also be removed as exons building up a gene do not have to be modulo3 and those will also be removed by your approach.

The only correct way is to parse in the gtf file and process it on a gene by gene basis, rather then line by line.

One exception here might be the case where all genes are single exon and only have a CDS annotated (== the case where all data from a single gene is present on one line)

ADD REPLY
0
Entering edit mode

Not sure what you're getting at. My assumption is they only have CDS features in their gtf. If not, just filter the gtf for only CDS features (an even easier task for awk). Please don't talk down other people's answers just because you have a competing interest in your own answer. Speaking of your answer, you should change it to a comment. You did not really provide a solution.

ADD REPLY
1
Entering edit mode

I absolutely do not.

I just want to point out that CDS is usually not present on one single line (it is a multi-line feature). A multi exon gene has CDS features of the same gene on different lines. So your awk (and I'm an awk fan myself!) solution will not handle this correctly.

Try your approach on this 'gtf/gff file'

scaffoldTest     test   CDS     16336   16505   .       +       0       ID=gene1
scaffoldTest     test   CDS     16537   16717   .       +       1       ID=gene1
scaffoldTest     test   CDS     16762   17597   .       -       0       ID=gene2
scaffoldTest     test   CDS     17629   18001   .       -       1       ID=gene2

you will notice you remove one exon from each of the CDS but never the whole CDS

ADD REPLY
1
Entering edit mode

you will notice you remove one exon from each of the CDS but never the whole CDS

It is clear you don't see the difference between a CDS and exon feature.

ADD REPLY
2
Entering edit mode

Trust me, I know that all too well!

I've been doing this for over ten years and have seen and processed a gazillion gff files and those were the CDS is only present on a single line are the exception. And yes, there is often only a single CDS (protein) per gene but when it's split over several exons you will several lines with 'CDS' in the gff file

I suggest you read up about the structure of a gff file, here for example: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

This is of course all (and only!) valid if we're talking about genomic gff files and thus not for instance one from transcriptome data as there they will indeed be on a single line

ADD REPLY
0
Entering edit mode

Every CDS I have come across is not split up into exons. That makes absolutely zero sense. Every search I did for a definition of CDS claims it spans from start codon to stop codon. Regardless, my answer solves precisely what OP asked.

ADD REPLY
1
Entering edit mode

OK, I'll try one last time: did you have a look at the link I posted in my comment above? If so, do you understand what is depicted in that GFF snippet?

Do make the distinction between the biological terms "CDS" and "exon" (in which sense I agree with your point of view) and how these terms are used/encoded in a gff file.

ADD REPLY
5
Entering edit mode

Just want to apologize. I was making too many assumptions about how CDS are annotated. I have not actually had to use CDS features in my own work. Thanks for being patient with me and even understanding my perspective. It makes complete sense that CDS are broken into only the coding regions. I guess it was just hard to believe that that nearly all exon features are the same as CDS features (besides UTRs).

ADD REPLY
2
Entering edit mode

Good on you for accepting your misunderstanding and improving! Sometimes questioning the things we take for solid scientific facts are the ones that make us become better scientists!

ADD REPLY
0
Entering edit mode

Exon features are different than CDS features. Many genes have multiple isoforms, which gives multiple CDS. This is different than saying a transcript has multiple exon features. The CDS should not be broken up into chunks like exon are. To reiterate, each CDS should start with the start codon and end with the stop codon. My code will remove any CDS which is not a multiple of 3. It solves OP's problem. Please stop spreading misinformation. If OP wants to remove other features (e.g. exons, UTRs, etc.), then that is another question to answer. If they only want one CDS per genes, that is also a separate question.

ADD REPLY
1
Entering edit mode

Exon feature are indeed different from CDS features (the CDS is a subpart of the joined exons) and indeed in the biological sense there is only one (or a few) CDS per gene, however, and this is exactly my point!, in a gff/gtf file they are usually present in different lines as it is only in exceptional cases (== single exon genes or bacterial genes) that they will be on a single line. Moreover this has nothing to do with the isoforms etc, even for a gene without isofroms but with multiple exons it will have more than one line with CDS tags in the gtf/gff file .

You should make the difference between the biological meaning of CDS and how they are "listed" in a gff/gtf file

ADD REPLY
2
Entering edit mode

There are potentially multiple CDS lines for each transcript in a .gtf file. Each CDS line define an interval that correspond to multiple exons. The length of each CDS line does not have to be in a multiple of three. But the sum length of multiple CDS lines for each transcript should be in a multiple of three.

So parsing just the individual CDS line will not work in this case.

ADD REPLY
0
Entering edit mode

If that were true, then there would be no difference between exon and CDS features. I'm sure this does happen, but I've yet to come across it. To me the problem is due to inconstancy in how the features are defined.

ADD REPLY
1
Entering edit mode

Yes, it is true that CDS features in a gtf/gff file are broken up into multiple intervals when the gene is multi-exonic.

Gene/Transcript features describe the entire gene from start of 5` UTR to end of 3` UTR. Exon features describe multiple exon intervals, including 5`/3` UTRs. CDS describes multiple CDS intervals which is essentially same as exons, but excluding the 5`/3` UTRs.

If the CDS was one just one line in the gtf/gff file, how would you define all the intronic coordinates? If there was just one CDS line describing start/end coordinates, the length of that feature would include all the intronic lengths. So in that case, it would still not be in multiples of 3.

ADD REPLY
0
Entering edit mode

for most of the exons there will indeed be no difference with the CDS ones, only for the UTR ones (and the CDS start and stop exons) those will differ, so both terms are still needed to capture the full biological representation.

ADD REPLY
2
Entering edit mode

This is an example from the RCAN1 gene from the Illumina (iGenomes) GTF annotation version of the Ensembl annotation:

1       protein_coding  start_codon     327957  327959  .       +       0       exon_number "1"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  exon    327877  328043  .       +       .       exon_number "1"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  CDS     327957  328043  .       +       0       exon_number "1"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; protein_id "ENSBTAP00000037080"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  exon    331233  331406  .       +       .       exon_number "2"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  CDS     331233  331406  .       +       0       exon_number "2"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; protein_id "ENSBTAP00000037080"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  exon    333963  334122  .       +       .       exon_number "3"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  CDS     333963  334122  .       +       0       exon_number "3"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; protein_id "ENSBTAP00000037080"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  exon    337306  339077  .       +       .       exon_number "4"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  CDS     337306  337475  .       +       2       exon_number "4"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; protein_id "ENSBTAP00000037080"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";
1       protein_coding  stop_codon      337476  337478  .       +       0       exon_number "4"; gene_id "ENSBTAG00000020035"; gene_name "RCAN1_BOVIN"; p_id "P12957"; transcript_id "ENSBTAT00000037243"; transcript_name "RCAN1_BOVIN"; tss_id "TSS1379";

For the first and last exons, which contain the UTRs, CDS and exons are not the same (CDS contains just the coding region of the exon), but the other exons, CDS and exons are equal. But CDS are encode following the respective exons on GTF / GFF files. Also, CDS need not be multiple of three, see the Are All Exon Nucleotide Sequence Counts Divisible By 3 In The Human Genome? thread and the paper The importance of being divisible by three in alternative splicing.

ADD REPLY
0
Entering edit mode
6.2 years ago
gcooper1245 ▴ 10

I believe the problem is that the CDS sequence is being broken into multiple lines, so I must find a way to simply combine those repeated lines of CDS. If anyone has a possible way I could do that it would be appreciated.

ADD COMMENT

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6