Question

Looping PairwiseAlignemnt {biostrings} for 25 proteins

1

Entering edit mode

6.3 years ago

tiriyon • 0

Hey I'm new here, and pretty new to R as well. I am trying to perform a PairwiseAlignment on 25 protein sequences from Uniprot. I have 25 Fasta files in this format:

>tr|A0A287AI92|A0A287AI92_PIG Carbonic anhydrase 1 OS=Sus scrofa OX=9823 GN=CA1 PE=1 SV=1MTSPAWGYDGEYGPEHWSKVYPIANGNNQSPIDIKTSETKHDTSLKPISV.....

I loaded the Fasta files to R using :

ProtSeq_1 <- read.fasta("C:/Users/tiriy/Documents/A0A287AI92.fasta")

Now I'm trying to order the Fasta files into data frame - which I miserably failed at doing and looping the Pairwise alignment with for loop like so:

comb<- combn(10,2)
MT1 <- a_data_frame
    for (i in 1:ncol(comb)) {
      x<-pairwiseAlignment( toString( MT1 [comb [1,i],1]),
     toString(MT1 [comb[2,i],1]),
    substitutionMatrix = "BLOSUM100", 
    gapOpening = -2, 
    gapExtension = -8, 
    scoreOnly = FALSE)
      fileA<-paste0("C:/Users/uri/Desktop/", i,"-","blusom",".txt")
      writePairwiseAlignments(x,file=fileA)}

How can I data frame all the sequences?
Or otherwise any other suggestion on looping the pairwise alignment on these fasta files?

Many thanks in advance! I hope I'd be able to give my aid back in the future

R sequence alignment • 2.5k views

ADD COMMENT • link 6.3 years ago by tiriyon • 0

0

Entering edit mode

6.3 years ago

tiriyon • 0

I mannaged to solve the issue with multiple fasta files and not being able to sort them into data frame, here is how I solved it:

all_fasta<-list.files("C:/Users/tiriy/Desktop/TheProject/ProjectR/fastas") 
# Making a vector to reffer to all fasta files in certain folder

A <- readAAStringSet(all_fasta) #Reading all fasta files at once {biostrings}
A1 <- names(A) # Setting vector to hold all the names from the fasta 
# files retrievd in vector A
A2 <- paste(A) # Setting a vector A2 with pasted values from vector A
ProtDF <- data.frame(A1,A2) # Using A1 as first olumn and A2 as second column

{Thanks to Onyambu over at stackoverflow} Now I got a data frame out of the fasta files, though still can't loop the pairwise alignment, here's what I got so far:

    # Step 3 making a generation of combinations for pairwise alignment 
D <- combn(25,2)
# End of  step 3

# Step 4 a for loop pairwise alignment for BLOSUM50
# Loading Data:
data("BLOSUM50")
nrow(ProtDF)
for (i in 1:nrow(ProtDF)) {
  LeVariableDumpable<- 
  PairwiseAlignments(toString(ProtDF[D[1,i],2])
                      ,toString(ProtDF[D[1,i],2]),
                              substitutionMatrix = "BLOSUM50",
                              gapOpening = -2,
                              gapExtension = -8,)
  filenameNloc<-paste0("C:/50bussums/",i,"-","BUSSUm",".txt")
  writePairwiseAlignments(LeVariableDumpable,
                          file=filenameNloc)
}

Edit: Now the function writes the files, but what I tryed to do with combn didn't work. how can I correct this function so that it will cycle each sequences pair possible? BTW, the data frame consists of column 1: seq numbers, column 2 : sequence letters. 25 rows.

ADD COMMENT • link 6.3 years ago by tiriyon • 0

score 1 · Accepted Answer · 2019-04-06

So since I was able to solve all my problems alone, I thought it would be helpful if I posted my script here for future reference. (Note the final product is a data frame for all combinations of sequences pairwise alignments with sequence descriptions for both subject and pattern and their scores)

# Installing and loading packages pairwise alignment blosum50 and blosum100 ----

# Installing seqinR R package:
install.packages("seqinr")
# seqinR end

# Installing Bioconductor - Biostrings
chooseCRANmirror() # Getting inside CRAN mirrors
1 
# Choosing the first option
install.packages("BiocManager") # Installing BiocManager in order to install Biostrings
BiocManager::install("Biostrings") # Installing Biostrings via BiocManager
# End of Biostrings installation: Biostrings, BiocManager

# seqinr:
require(seqinr)
# seqinr end

# Biostrings:
require(Biostrings) # Loading Biostrings into library
#Biostrings end


# Step 1 - Listing all fasta files:----
all_fasta<-list.files("C:/Users/tiriy/Desktop/TheProject/ProjectR/fastas") 
# Making a vector to reffer to all fasta files in certain folder
# End of step 1

# Step 2 - Making a data frame out of the files ----
# using biostrings package - readAAStringSet
# And converting it into a dataframe:
A <- readAAStringSet(all_fasta) #Reading all fasta files at once {biostrings}
A1 <- names(A) # Setting vector to hold all the names from the fasta 
# files retrievd in vector A
A2 <- paste(A) # Setting a vector A2 with pasted values from vector A
ProtDF <- data.frame(A1,A2) # Using A1 as first olumn and A2 as second column
# End of step 2 

# Step 3 making a generation of combinations for pairwise alignment ----
D <- combn(25,2)
# End of easy peasy step 3

# Step 4 loading data:---- 
# 
# Loading Data:
data("BLOSUM50")
data("BLOSUM100")
# End of step 4

# Step 5 - assigning empty vectors for a for loop----
ScoreX <- c()
nameSeq1 <- c()
nameSeq2 <- c()
# End of step 5

# Step 6 THE LOOP ----
for (i in 1:300) { 
    dumpme<-pairwiseAlignment(toString(ProtDF[D[1,i],2]) # combining all pairwise alignment possible
                              ,toString(ProtDF[D[2,i],2]), # for 25 sequences = 300 combinations
                              substitutionMatrix = "BLOSUM50", # pairwise alignment settings
                              gapOpening = -2,
                              gapExtension = -8,
                              scoreOnly=FALSE)
   # filling vectors
  ScoreX[i]<-c(dumpme@score)   #score
  nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
  nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))

}
#End of step 6 

# Step 7 creating a dataframe ----
BLOSUM50DF<- data.frame(nameSeq1,nameSeq2,ScoreX ) # DF consists of columns sequence 1 and 2 names as writen in fasta
colnames(BLOSUM50DF)<-c("Sequence 1","Sequence 2","Score") # score achieved in pairwise alignment calculation
write.csv(BLOSUM50DF,file="C:/Users/tiriy/Desktop/TheProject/BLOSUM50DF.csv") # writing to file (csv)
# End of step 7