Hello Biostars,
I have multiple gene sequences created with https://samtools.github.io/bcftools/howtos/consensus-sequence.html for probands. I would like to analyse them at protein level, to retrieve the protein sequence from those individuals including all variants. The point is to analyse some specific mutations in context of common polymorphisms surrounding them.
So, I have the complete gene sequence at nucleotidic level and I would like to retrieve the coding sequence to then translate it.
My approach was to get the CCDS of my gene of interest (for example https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS10509 for CREBBP) and then align my full sequence onto it to reconstruct the proband CDS. I used blastn on NCBI's website (option megablast) and it aligned well but with it rose some issues.
The end of an aligned match usually doesn't exactly match with the beginning of the next match. For example, one match ending at position 3701 and the next one beginning at position 3697.
So, downloading the aligned matches then creating a script to concatenate them together wouldn't work. I could do it manually but I have too much genes to do it.
Is there an alternative and easier solution?
Thank you