Question

Lookup Table error in Sequence Alignment with R

1

Entering edit mode

6.8 years ago

pattijane ▴ 10

Hello,

I'm using the following code to read from a text file containing more than 200 fasta files and to align these proteins.

seqs = read.fasta(file = "sequencefasta.txt", seqtype = "AA")
self_aligned_score = rep(NA, length(seqs))

for (i in 1:length(seqs)){
  cat('aligning ',i,'/',length(seqs),' with itself..\n')
  seq = paste(toupper(getSequence(seqs[i][[1]])), sep="", collapse="")
  align_score = pairwiseAlignment(seq, seq, scoreOnly=TRUE, gapExtension=0.5, type="local", substitutionMatrix = "BLOSUM50")
  self_aligned_score[i] = align_score
}

But I'm having the following error, which I have no idea about why.

Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : key 32 not in lookup table

Here is a snippet of my sequencefasta.txt file:

>sp|O00141|SGK1_HUMAN Serine/threonine-protein kinase Sgk1 OS=Homo sapiens GN=SGK1 PE=1 SV=2
MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNSYACKHPEVQSILKI
SQPQEPELMNANPSPPPSPSQQINLGPSSNPHAKPSDFHFLKVIGKGSFGKVLLARHKAE
EVFYAVKVLQKKAILKKKEEKHIMSERNVLLKNVKHPFLVGLHFSFQTADKLYFVLDYIN
GGELFYHLQRERCFLEPRARFYAAEIASALGYLHSLNIVYRDLKPENILLDSQGHIVLTD
FGLCKENIEHNSTTSTFCGTPEYLAPEVLHKQPYDRTVDWWCLGAVLYEMLYGLPPFYSR
NTAEMYDNILNKPLQLKPNITNSARHLLEGLLQKDRTKRLGAKDDFMEIKSHVFFSLINW
DDLINKKITPPFNPNVSGPNDLRHFDPEFTEEPVPNSIGKSPDSVLVTASVKEAAEAFLG
FSYAPPTDSFL

>sp|O00311|CDC7_HUMAN Cell division cycle 7-related protein kinase OS=Homo sapiens GN=CDC7 PE=1 SV=1
MEASLGIQMDEPMAFSPQRDRFQAEGSLKKNEQNFKLAGVKKDIEKLYEAVPQLSNVFKI
EDKIGEGTFSSVYLATAQLQVGPEEKIALKHLIPTSHPIRIAAELQCLTVAGGQDNVMGV
KYCFRKNDHVVIAMPYLEHESFLDILNSLSFQEVREYMLNLFKALKRIHQFGIVHRDVKP
SNFLYNRRLKKYALVDFGLAQGTHDTKIELLKFVQSEAQQERCSQNKSHIITGNKIPLSG
PVPKELDQQSTTKASVKRPYTNAQIQIKQGKDGKEGSVGLSVQRSVFGERNFNIHSSISH
ESPAVKLMKQSKTVDVLSRKLATKKKAISTKVMNSAVMRKTASSCPASLTCDCYATDKVC
SICLSRRQQVAPRAGTPGFRAPEVLTKCPNQTTAIDMWSAGVIFLSLLSGRYPFYKASDD
LTALAQIMTIRGSRETIQAAKTFGKSILCSKEVPAQDLRKLCERLRGMDSSTPKLTSDIQ
GHASHQPAISEKTDHKASCLVQTPPGQYSGNSFKKGDSNSCEHCFDEYNTNLEGWNEVPD
EAYDLLDKLLDLNPASRITAEEALLHPFFKDMSL

R sequence alignment fasta protein • 4.0k views

ADD COMMENT • link updated 3.9 years ago by yehong39 • 0 • written 6.8 years ago by pattijane ▴ 10

0

Entering edit mode

Which libraries are you using to import the fasta and to align the sequences?

ADD REPLY • link 6.8 years ago by Selenocysteine ▴ 620

0

Entering edit mode

I use these:

require("Biostrings")
require("seqinr")
require("Rcpp")

ADD REPLY • link 6.8 years ago by pattijane ▴ 10

0

Entering edit mode

I think that the issue might be that you have some non-standard characters in your sequences. Can you check that you just have aminoacids, no "X"s and no other letters which do not correspond to an aa?

ADD REPLY • link 6.8 years ago by Selenocysteine ▴ 620

0

Entering edit mode

I checked for "X"s and there was none. Besides, wiki article for aa describes X as a valid letter to describe unknowns, https://www.wikiwand.com/en/Amino_acid#/Table_of_standard_amino_acid_abbreviations_and_properties

ADD REPLY • link 6.8 years ago by pattijane ▴ 10

0

Entering edit mode

Yes it is standard, but maybe for some reason this library doesn't recognise it. Did you check for all characters which are not standard amino acids or just X?

ADD REPLY • link 6.8 years ago by Selenocysteine ▴ 620

0

Entering edit mode

I checked for X, U, O, B, Z, and J. There were none of them.

ADD REPLY • link 6.8 years ago by pattijane ▴ 10

score 0 · Answer 1 · 2021-02-03

Hi, I ran your code using your example .txt file, it worked fine, no errors. The error message you presented here indicated your sequence has extra space in it, Maybe there are extra spaces at the end or beginning of another sequence? You can try add one more line, after: seq = paste(toupper(getSequence(seqs[i][[1]])), sep="", collapse="") add: seq= gsub(" ","", seq) Then run the code again, see if that works. Good luck!