Entering edit mode
7.8 years ago
biobudhan
▴
20
I have a folder with over 1000 alignments performed using MAFFT. Are there any tools that can tell the length of the sequence alignment?
example:
seq1 ATGC-CTGA-TTTGGG-
seq2 ATGCCCTGATTTT-GGC
In this case the alignment length is: 17
What format is this? What separates the id from the alignment? Do all the alignments include two stars in the 5' and 3'-ends? Are they always in a single line as in your example?
Assuming that there's a tab between the ID and the sequence and that there are four stars in every sequence, then:
I am sorry for the confusion. I tried to differentiate the ID from the sequences using the bold and italic option on biostars. To answer your questions: What format is this? : All my files are in .aln format (alignment format of clustal) Do all the alignments include two stars in the 5' and 3'-ends? No. These are the formatting options from biostars that I mentioned earlier. Are they always in a single line as in your example? No. Many files contain multiple lines.
You'll need to parse your .aln file from clustal. You can look into BioPython that can read in the alignment file and output a FASTA with a single line per sequence, or if you don't intend to write out to a FASTA, just get the length of each alignment. See this post. You can also run clustal from within Biopython, and in the same script, get the length of each alignment.
Thank you. :-) I will check this option!