I have a consensus genome created by incorporating only biallelic SNPs into the reference genome. I want to get the protein sequence of a particular gene from my consensus genome.
I tried using the reference gene CDS taken from NCBI through blastn to do this. I got a single hit, spanning multiple ranges. I concatenated all the aligned nucleotides from the consensus and tried to translate them but the reference protein is not found in its entirety in any frame. The reference protein is split across several frames, something I did not expect, because there are only SNPs present in the consensus.
Any idea why this is happening and solutions to get the protein sequence?
Thanks, gmap worked like a charm!
Frameshift mutations are not possible as I did not include single nucleotide insertions, just SNPs. So perhaps I might have premature stop codons. I will try the aligners you have suggested, and perhaps look at the number of mutations too, and get back. Thanks!
correct you are.
then it's likely because blast does not provided you a correct gene structure (not a surprise neither, that's not it's goal). Yes give the gene mappers a try and see what that gives.
an alternative to this could be to transfer the annotation of your reference (given it has one) and then based on that extract your protein sequence.
For the latter, I only know RATT. Do you know any other tools?
was also thinking of that one indeed.
there is also 'liftOver' from the ALLmaps package if I remember well.