Sequence alignment length
0
0
Entering edit mode
7.8 years ago
biobudhan ▴ 20

I have a folder with over 1000 alignments performed using MAFFT. Are there any tools that can tell the length of the sequence alignment?

example:

seq1      ATGC-CTGA-TTTGGG-
seq2      ATGCCCTGATTTT-GGC

In this case the alignment length is: 17

multiple-sequence-alignment • 5.5k views
ADD COMMENT
0
Entering edit mode

What format is this? What separates the id from the alignment? Do all the alignments include two stars in the 5' and 3'-ends? Are they always in a single line as in your example?

Assuming that there's a tab between the ID and the sequence and that there are four stars in every sequence, then:

cat -t file 
*seq1*^I**ATGC-CTGA-TTTGGG-**
*seq2*^I**ATGCCCTGATTTT-GGC**

awk 'BEGIN{FS="\t"}NR==1{print length($2)-4}' file
17
ADD REPLY
0
Entering edit mode

I am sorry for the confusion. I tried to differentiate the ID from the sequences using the bold and italic option on biostars. To answer your questions: What format is this? : All my files are in .aln format (alignment format of clustal) Do all the alignments include two stars in the 5' and 3'-ends? No. These are the formatting options from biostars that I mentioned earlier. Are they always in a single line as in your example? No. Many files contain multiple lines.

ADD REPLY
0
Entering edit mode

You'll need to parse your .aln file from clustal. You can look into BioPython that can read in the alignment file and output a FASTA with a single line per sequence, or if you don't intend to write out to a FASTA, just get the length of each alignment. See this post. You can also run clustal from within Biopython, and in the same script, get the length of each alignment.

ADD REPLY
0
Entering edit mode

Thank you. :-) I will check this option!

ADD REPLY

Login before adding your answer.

Traffic: 2246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6