Question

Extract sequence subset from multiple sequence alignment based on position in a specific species.

0

Entering edit mode

4.6 years ago

nad7wf • 0

I would like to generate a logo for the amino acid domain surrounding a particular mutation of interest. I am using the "msa" package in R to generate the multiple sequence alignment based on the full peptide sequence, then I would like to extract a short subset of the consensus matrix that surrounds my AA of interest in order to generate the logo. I know the position of the amino acid in my species (Glycine max). However, as expected, when aligned to other sequences, gaps in my sequence relative to the sequence in other species cause my amino acid of interest to be shifted to a new position in the final alignment. This makes it difficult to automate the process of extracting the amino acids surrounding my mutation based on position.

My question is, is there a way to extract a subset of the multiple sequence alignment based on the position in my species of interest so that I get the amino acid I'm interested in, regardless of its new position in the MSA? Ultimately, I plan to wrap this into an R function so I can run this on many sequences, so finding the new position by eye, based on the surrounding sequence, is not really an option.

Example

Before alignment, AA of interest is at position 82

After alignment, AA of interest is now at position 95

I would like to pull a few amino acids on either side of this Lysine in order to generate a logo (in reality, I have many more species in my MSA). However, once I've aligned the sequences, I can no longer use "position 82" to pull the correct amino acids. Is there a way to extract the desired amino acids based on their position in Glycine max, specifically?

Example Fasta to work with (is there a better way to format this when posting a question? I'm new here, sorry!):

>Glycine_max MSNPSDEKEQCQKKRKSTICEASNFKTSRRRFFSNKNEEDMNKGVSTTLKLYDDPWKIKKTLTDSDLGILSRLSLAADLVKKQI >Jatropha_curcas RMRFYTAEDLEEERRKGISTNLSLSYRYNTVNILDNNLDLWKIKKRLKESDLGNLSRLLVPSNLVKKDVLPLLSSDLVNG IKSKRGAPVVFWDVDTETTHNLVFKKWDSSGS

Code used for generating MSA and consensus matrix:

library(msa)
alignment <- msaClustalOmega(readAAStringSet("/path/to/fasta_file.fa"))

# This pulls the wrong amino acids.
consensus_mat <- consensusMatrix(alignment)[, 72:84]

R Biostrings msa • 2.5k views

ADD COMMENT • link updated 2.0 years ago by rohitsatyam102 ▴ 930 • written 4.6 years ago by nad7wf • 0

0

Entering edit mode

Hi were u able to figure this out?

ADD REPLY • link 2.0 years ago by rohitsatyam102 ▴ 930