Hello,
I'm currently doing a small bioinformatics project where I'm downloading multiple FASTA files from NCBI Virus and want to locate the spike glycoprotein-encoding locus on each of the samples.
I filtered by nucleotide completeness but I noticed that the genome sequence lengths are variable w.r.t. each other as well as the reference genome for this taxon.
Because of this, I'm not sure if simply taking the start and end locations of the spike glycoprotein-encoding locus on the GFF file will be in the correct reading frame, or will even correspond to the target gene even if it is.
Will this work, and if not, will I have to do some alignment? And if I do have to align my samples to the ref genome, is there a less computationally-intensive way I can do it, such as through Google Colab, or would I need to do this on a desktop?
Thank you!
have you tried directly blasting the protein sequence against the Betacoronavirus database in NCBI blast, so a tblastn if you directly want to search with the amino acid sequence?