Hello I will like to know how can I extract multiple fasta sequences from a file that have a list of the IDs (133 in total) I want to extract. I have started by loading my fasta and ID file in R:
library("seqinr")
fastafile<- read.fasta(file = "proteins.fasta",
seqtype = "AA",as.string = TRUE, set.attributes = FALSE)
head(fastafile)
$`1.1.1.m1`
[1] "MRRRGQWWFTAETSVGQTANTSANSDLLSPAFWLVRGHEFKITRSDDPQHTALLQTSDDCLGGQTFRAKITSYGRFTERESWEIKPNVDGCRGSCNVSYAGRFEETVGFKQAKCSSRIQSEKNIGFWCAIGSRGSVMMIGGGGKPCTLGDHGIGITNAKDRSFSHSPSSKRNDFGDVATSSPETSYSLNLWIQ"
$`1.1.2.m1`
[1] "MHEHTSQSVACGAQTEEVLRSITMRRKTNYQTATTCLVKLIFEHVLNVRKTNSIEKFDGLEARHRKHIKEIVALEINPNSFGISERQGPIPQPVILFPLNAEYQARDVKNRTAPGIPSGVSLAPGPNGEKDGSYEFFGNTNSFIEFPNSPRGALDVLYSITILCWVYYDEKGGPHGLIFEYNTGGKYGVHLWVVNRLFSARFIDRAFSYSRPYLRHTSLAGGWKFVGASYDNETGEIKLWADGA"
co2=read.table("trt_co.csv",header=T, sep=",")
head(co2)
1 1.1.10073.m1
2 1.1.10395.m1
3 1.1.10428.m1
4 1.1.10509.m1
5 1.1.10621.m1
6 1.1.10760.m1
I will appreciate your help on what would be the next step.
Thanks
You use read.table(header = T), but head(co2) does not show column names? Or did you just omit them from this post?
I'm just curious, why would you do that in R?
One reason: because your downstream analysis is most easily performed in R. The seqinr package contains a lot of useful functions for statistical analysis of sequences; reading them in is just the first step.
In which program would you advice doing it?
I would extract them before I load it into R, for example using 'grep' (if your fasta hast sequences in just one line) some script pyhton/bash/perl....
if you are on a linux machine you can also go with kent source utils: kent source. There is a lof of usefull stuff in your case look at "faSomeRecords".