Hi,
I am a PhD student with very little experience in bioinformatics (or very little experience at all, I started two months ago). I’m having some problems getting Snpeff to work with gff coordinates obtained by Transdecoder. I was given by a group with which I am collaborating the assembly of a genome and a gtf file with transcript information derived from RNAseq. I used Transdecoder following the instructions, with the –single_best_orf option, and I got the cds file and a gff3. I used the gff3 to build a database for snpeff, because I have to evaluate the effect of some SNPs on the genome. Howevere, when I launched Snpeff eff, I received a great number of warnings:
INFO_REALIGN_3_PRIME 1
WARNING_TRANSCRIPT_NO_START_CODON 202855
WARNING_TRANSCRIPT_NO_START_CODON&INFO_REALIGN_3_PRIME 2
WARNING_TRANSCRIPT_NO_STOP_CODON 17281
Protein coding transcripts : 2426
# Length errors : 0 ( 0,00% )
# STOP codons in CDS errors : 0 ( 0,00% )
# START codon errors : 686 ( 28,28% )
# STOP codon warnings : 183 ( 7,54% )
# UTR sequences : 2409 ( 99,30% )
# Total Errors : 686 ( 28,28% )
Given the low number of transcripts, this amount of warnings seems to be extremely high. Is it normal? Also, I checked the CDSs obtained by Transdecoder and, even if not all of them start with ATG, all of them have a start codon near the beginning of the sequence, so I really cannot explain this number of warnings. Do you have any suggestions? May the life of he/she who comes to my aid be filled with cakes and pizzas.
Best regards
Edoardo
Hello edoardo.piombo!
It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?p=201232
This is typically not recommended as it runs the risk of annoying people in both communities.
I'm very sorry, I didn't know it would be annoying. Which site do you think is the best to contain the question? I will promptly remove it from the other one.
It's most commonly discouraged to cross-post. As such not two communities have to think about your answer. More guidelines can be found here: How To Ask Good Questions On Technical And Scientific Forums
I'm not sure which one would be the best. I think Biostars is more active although you could say I'm rather biased.
With regard to your question, I can imagine assembly after RNA-seq leading to incomplete transcripts, but indeed the numbers are high (not that I have any experience with your type of analysis). You could check for example for a snpeff database if it's mandatory to start with an ATG. Perhaps snpeff can't find ATGs 'near the beginning.
Understood. I apologize again for the cross posting. I tried to remove the thread from Seq Answers but it seems that I am unable to do so. Thanks for your advice on the question, I tried to remove the transcript whose CDS didn't start with a start codon and the warnings disappeared, but to do so I had to remove more than a quarter of the transcripts. I tried to use orf finder on various transcripts not starting with a start codon and they seem to contain good ORFs, so I don't think it would be ideal to remove them from the analysis. Do you know of a way to predict CDSs forcing the programs to start with a start codon? I tried to use the tool "get_orfs_or_cdss.py" from pico_galaxy, using the option "-e closed", which should ensure start/stop codons are present, but I still get a lot of CDSs not starting with ATG. Also, I have no idea as to why SnpEff gives me 202855 warnings when only around 700 transcripts have the start codon problem. An other way would be to use an other tool instead of SnpEff, but I didn't find any similar tools with the option of building my own database
I think it's a warning once per variant in a problematic transcript.