Classify conserved and non-conserved sequences
2
0
Entering edit mode
2.9 years ago
Rajavee ▴ 10

I was wondering if there is an open-source code available that allows me to group sequences from a MSA into conserved and non-conserved based on a single residue. I have an alignment and I want to isolate the sequences that do not have a glutamine at position 61 from this alignment.

conserved multiple residues Biopython sequence alignment classification Python • 1.0k views
ADD COMMENT
1
Entering edit mode
2.9 years ago
Wayne ★ 2.1k

Biopython has a way (mentioned here) whereby you can specify a key function to sort on, see sort(self, key=None, reverse=False) under Bio.Align .

The example you want to pay attention to is under 'you could sort on the GC content of each sequence' in that section:

align1.sort(key = lambda record: GC(record.seq))

That function provided can be edited so you can sort alphabetically on a specific position, such as for the position at index #5, you'd write:

align1.sort(key = lambda record: record.seq[5])

You don't say which format your sequences are in, and so I cannot provide a full example of reading it in and accessing the individual records. There's examples of that in Biopython documentation here. I have an example here and here, with more examples to be found in the scripts in my aliignment-utilities sub-repo.
Beaware of Python indexing being zero based.

If you need a place to try out Biopython stuff, the latest release will be installed and available in sessions that come up if you go here and click launch binder.

ADD COMMENT
0
Entering edit mode

Thank you for your help. I used pandas dataframe to solve this problem. :)

ADD REPLY
1
Entering edit mode
2.9 years ago
Rajavee ▴ 10

Here's the code I wrote

#Create a list with the SeqID and Sequence records from the alignment
l= []
for i in alignment.__dict__['_records']:
    l.append([i.id,"".join(i.seq)])
df = pd.DataFrame(l)
df.columns = ['id','seq']
df

#Generating a column for analyzing the respecting aa residue

df['pos_87'] = df['sequence'].apply(lambda x:x[86])

#Checking if it is conserved across other sequences in the alignment

df['conserved'] = df['pos_87'].apply(lambda x:x=='I')
ADD COMMENT

Login before adding your answer.

Traffic: 2540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6