I'm trying to edit an MSA (Multiple Sequence Alignment) file generated by ClustalW, to trim sequences before the consensus one, using BioPython. xxx refers to other bases not relevant here
Here's the example I/O :
INPUT
ITS_primer_fw --------------------------------CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence CCACCCCAACAAGGGCGGCCACGCGGTCCGCTCGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxx
PRL2010 ACACCCCCGAAAGGGCGTCC------CCTGCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxx
BBF32_3 ACACACCCACAAGGGCGAGCAGGCG----GCTCGCGTCCACTATCCAGTTxxxxxxxxxxxxxx
BBFCG32 CAACACCACACCGGGCGAGCGGG-------CTCGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxx
EXPECTED OUTPUT
ITS_primer_fw CGCGTCCACTMTCCAGTT
RBL67ITS_full_sequence CGCGTCCACTCTCCAGTTxxxxxxxxxxxxxxxxxxxx
PRL2010 CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxxxx
BBF32_3 CGCGTCCACTATCCAGTTxxxxxxxxxxxxxxxxxxx
BBFCG32 CGCGTCCACTGTCGAGTTxxxxxxxxxxxxxxxxxxxx
The documented code for AlignIO
describes just a way to extract sequences by treating the alignment as an array. In this example
align = AlignIO.read(input_file, "clustal")
sub_alignment = align[:,20:]
I was able to extract a subalignment made by all the sequences (:) starting from the 20th nucleotide. I'm looking for a way to replace the 20
in the example with the position of the first nucleotide of the consensus sequence.
Any answers including some cline software to trim easly as requested are well accepted. Will be great if python coded or for UNIX.
By consensus, do you mean you want the first position that is fully occupied in all columns? A true consensus sequence would have characters from the start of the alignment, so defining it from the "first character of the concensus" would just give you the data you already have.
Yes, first position occupied in all the columns.
Can you provide you actual alignment please, your example is good for explaining, but not so much for testing.
Got just the alignment done against the first two sequences.