I was wondering if there is an open-source code available that allows me to group sequences from a MSA into conserved and non-conserved based on a single residue. I have an alignment and I want to isolate the sequences that do not have a glutamine at position 61 from this alignment.
The example you want to pay attention to is under 'you could sort on the GC content of each sequence' in that section:
align1.sort(key = lambda record: GC(record.seq))
That function provided can be edited so you can sort alphabetically on a specific position, such as for the position at index #5, you'd write:
align1.sort(key = lambda record: record.seq[5])
You don't say which format your sequences are in, and so I cannot provide a full example of reading it in and accessing the individual records. There's examples of that in Biopython documentation here. I have an example here and here, with more examples to be found in the scripts in my aliignment-utilities sub-repo.
Beaware of Python indexing being zero based.
If you need a place to try out Biopython stuff, the latest release will be installed and available in sessions that come up if you go here and click launch binder.
#Create a list with the SeqID and Sequence records from the alignment
l= []
for i in alignment.__dict__['_records']:
l.append([i.id,"".join(i.seq)])
df = pd.DataFrame(l)
df.columns = ['id','seq']
df
#Generating a column for analyzing the respecting aa residue
df['pos_87'] = df['sequence'].apply(lambda x:x[86])
#Checking if it is conserved across other sequences in the alignment
df['conserved'] = df['pos_87'].apply(lambda x:x=='I')
Thank you for your help. I used pandas dataframe to solve this problem. :)