Question

Large number of gaps at the beginning of alignments

0

Entering edit mode

15 months ago

Pit • 0

Hi good people,

I am doing a phylogenomics project using pre-existing genomes from GenBank. I have isolated sequences of interest and combined them into fasta files. When I performed alignment through Clustal Omega, however, some of those sequences had a long streak of gaps before the start codon, like this one:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ATGGAGGAAA----------CGCAGATAACACCCTCACT--GTCGGT-----CCCCAGCT------------------------GGTCCTCAGTCTCCACCCAGGACCAGTCCTCCAC----TGCCGGCCAAGATTCGAGCACAGGCCCACAGCCC

Oftentimes, within the same file (25 versions of a gene sequence from 25 species) all sequences have those gaps with varying lengths. I'm not sure why those gaps exist at all, or ways to clean them. Has anyone encountered similar problems before? Am I doing something wrong? Any help is appreciated

clustal-omega DNA alignment • 1.7k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 15 months ago by Pit • 0

0

Entering edit mode

Hi, just my two cents here. Depending on the gene and the species you used, 5'UTR might not be as well conserved as the CDS, hence the larger gaps. I recommend visualising the alignments (Aliview or Jalview are great for this), also checking out CIAlign to clean the alignment for gappy regions if you want to continue using all the transcript sequence. I personally would use only the CDS sequence for all the genes, since it tends to be more evolutionarily constrained and might be more insightful for a tree construction :)

ADD REPLY • link 15 months ago by biofalconch ★ 1.3k

0

Entering edit mode

Are these 25 species closely related or are expected to be distant to each other? Where are you selecting the sequences from (gene, nucleotide or genome databases)? This may have a critical impact on what you are observing.

ADD REPLY • link 15 months ago by GenoMax 148k

0

Entering edit mode

They are all marine mammals. The sequences are isolated from full genomes using BUSCO.

ADD REPLY • link 15 months ago by Pit • 0

score 0 · Answer 1 · 2023-09-14

0

Entering edit mode

15 months ago

Michael 55k

This may be expected in case you have combined sequences of varying length, e.g. sometimes you selected only the coding sequence, and sometimes the sequence also includes some upstream sequence, or some sequences are incomplete and therefore shorter or some species have a longer sequence than the rest. To avoid this, you could restrict the sequence export to the coding sequence for all genes. If there is still one sequence that is much longer and thereby causes gaps, you could think of removing this sequence as a potential outlier. On the other hand, it will likely not affect the phylogenetic analysis much. You can also left-trim your alignment by removing everything before the first well-conserved column, but that is mostly cosmetics.

ADD COMMENT • link 15 months ago by Michael 55k

0

Entering edit mode

Is that so? I suppose it would be hard to expect all 25 versions to be of similar lengths. Though when I was using all of them to construct a gene tree, it kept giving me the same wrong tree. And later when I was converting some of the fasta files into phylip format for further analysis, I noticed that some of them had length not divisible by 3. I wonder if it's because of the gaps or something else?

ADD REPLY • link 15 months ago by Pit • 0

0

Entering edit mode

Most aligners, by default, will insert gaps anywhere in the sequence, including within codons. Therefore, even if all of your sequences are valid coding sequences, you can still get total alignment lengths that are not divisible by 3.

When I'm doing an analysis where preserve the coding sequence is crucial, I tend to use an aligner that is codon-aware. For example, Prank. I'm sure there are others.

ADD REPLY • link 15 months ago by Dave Carlson ★ 2.1k

0

Entering edit mode

I will have to try Prank then. Are there any trimming tools that process the alignments while being codon-aware, or can I just assume they all do?

ADD REPLY • link 15 months ago by Pit • 0

0

Entering edit mode

It is hard to tell what exactly the reason is without knowing the alignment. Could you post the aligned file? Also if you open the alignment in Jalview, you will likely spot the problem immediately. If your tree topology is off, that may indeed indicate a problem with the alignment or the orthologue selection.

ADD REPLY • link 15 months ago by Michael 55k

0

Entering edit mode

I'm not sure how to post the file here, but I did take a look at one of the alignments in question in AliView. It seems that one of the sequences is much longer than the others, resulting in the gaps.

ADD REPLY • link 15 months ago by Pit • 0