Question

Sequence alignment length

0

Entering edit mode

7.8 years ago

biobudhan ▴ 20

I have a folder with over 1000 alignments performed using MAFFT. Are there any tools that can tell the length of the sequence alignment?

example:

seq1      ATGC-CTGA-TTTGGG-
seq2      ATGCCCTGATTTT-GGC

In this case the alignment length is: 17

multiple-sequence-alignment • 5.5k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 7.8 years ago by biobudhan ▴ 20

0

Entering edit mode

What format is this? What separates the id from the alignment? Do all the alignments include two stars in the 5' and 3'-ends? Are they always in a single line as in your example?

Assuming that there's a tab between the ID and the sequence and that there are four stars in every sequence, then:

cat -t file 
*seq1*^I**ATGC-CTGA-TTTGGG-**
*seq2*^I**ATGCCCTGATTTT-GGC**

awk 'BEGIN{FS="\t"}NR==1{print length($2)-4}' file
17

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

I am sorry for the confusion. I tried to differentiate the ID from the sequences using the bold and italic option on biostars. To answer your questions: What format is this? : All my files are in .aln format (alignment format of clustal) Do all the alignments include two stars in the 5' and 3'-ends? No. These are the formatting options from biostars that I mentioned earlier. Are they always in a single line as in your example? No. Many files contain multiple lines.

ADD REPLY • link 7.8 years ago by biobudhan ▴ 20

0

Entering edit mode

You'll need to parse your .aln file from clustal. You can look into BioPython that can read in the alignment file and output a FASTA with a single line per sequence, or if you don't intend to write out to a FASTA, just get the length of each alignment. See this post. You can also run clustal from within Biopython, and in the same script, get the length of each alignment.