Hi good people,
I am doing a phylogenomics project using pre-existing genomes from GenBank. I have isolated sequences of interest and combined them into fasta files. When I performed alignment through Clustal Omega, however, some of those sequences had a long streak of gaps before the start codon, like this one:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ATGGAGGAAA----------CGCAGATAACACCCTCACT--GTCGGT-----CCCCAGCT------------------------GGTCCTCAGTCTCCACCCAGGACCAGTCCTCCAC----TGCCGGCCAAGATTCGAGCACAGGCCCACAGCCC
Oftentimes, within the same file (25 versions of a gene sequence from 25 species) all sequences have those gaps with varying lengths. I'm not sure why those gaps exist at all, or ways to clean them. Has anyone encountered similar problems before? Am I doing something wrong? Any help is appreciated
Hi, just my two cents here. Depending on the gene and the species you used, 5'UTR might not be as well conserved as the CDS, hence the larger gaps. I recommend visualising the alignments (
Aliview
orJalview
are great for this), also checking outCIAlign
to clean the alignment for gappy regions if you want to continue using all the transcript sequence. I personally would use only the CDS sequence for all the genes, since it tends to be more evolutionarily constrained and might be more insightful for a tree construction :)Are these 25 species closely related or are expected to be distant to each other? Where are you selecting the sequences from (gene, nucleotide or genome databases)? This may have a critical impact on what you are observing.
They are all marine mammals. The sequences are isolated from full genomes using BUSCO.