Question

How To Extract Multiple Fasta Sequences At A Time From A File Containing Sequences Ids Using R

0

Entering edit mode

10.8 years ago

catagui ▴ 40

Hello I will like to know how can I extract multiple fasta sequences from a file that have a list of the IDs (133 in total) I want to extract. I have started by loading my fasta and ID file in R:

library("seqinr")
fastafile<- read.fasta(file = "proteins.fasta", 
                       seqtype = "AA",as.string = TRUE, set.attributes = FALSE)

head(fastafile)

$`1.1.1.m1`
[1] "MRRRGQWWFTAETSVGQTANTSANSDLLSPAFWLVRGHEFKITRSDDPQHTALLQTSDDCLGGQTFRAKITSYGRFTERESWEIKPNVDGCRGSCNVSYAGRFEETVGFKQAKCSSRIQSEKNIGFWCAIGSRGSVMMIGGGGKPCTLGDHGIGITNAKDRSFSHSPSSKRNDFGDVATSSPETSYSLNLWIQ"

$`1.1.2.m1`
[1] "MHEHTSQSVACGAQTEEVLRSITMRRKTNYQTATTCLVKLIFEHVLNVRKTNSIEKFDGLEARHRKHIKEIVALEINPNSFGISERQGPIPQPVILFPLNAEYQARDVKNRTAPGIPSGVSLAPGPNGEKDGSYEFFGNTNSFIEFPNSPRGALDVLYSITILCWVYYDEKGGPHGLIFEYNTGGKYGVHLWVVNRLFSARFIDRAFSYSRPYLRHTSLAGGWKFVGASYDNETGEIKLWADGA"

co2=read.table("trt_co.csv",header=T, sep=",")
head(co2)

1 1.1.10073.m1
2 1.1.10395.m1
3 1.1.10428.m1
4 1.1.10509.m1
5 1.1.10621.m1
6 1.1.10760.m1

I will appreciate your help on what would be the next step.

Thanks

r fasta • 11k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.8 years ago by catagui ▴ 40

0

Entering edit mode

You use read.table(header = T), but head(co2) does not show column names? Or did you just omit them from this post?

ADD REPLY • link 10.8 years ago by Neilfws 49k

0

Entering edit mode

I'm just curious, why would you do that in R?

ADD REPLY • link 10.8 years ago by Phil S. ▴ 700

0

Entering edit mode

One reason: because your downstream analysis is most easily performed in R. The seqinr package contains a lot of useful functions for statistical analysis of sequences; reading them in is just the first step.

ADD REPLY • link 10.8 years ago by Neilfws 49k

0

Entering edit mode

In which program would you advice doing it?

ADD REPLY • link 10.8 years ago by catagui ▴ 40

0

Entering edit mode

I would extract them before I load it into R, for example using 'grep' (if your fasta hast sequences in just one line) some script pyhton/bash/perl....

if you are on a linux machine you can also go with kent source utils: kent source. There is a lof of usefull stuff in your case look at "faSomeRecords".

ADD REPLY • link 10.8 years ago by Phil S. ▴ 700

Ram · Answer 1 · 2014-02-12

2

Entering edit mode

10.8 years ago

Neilfws 49k

You don't say what the column in co2 containing the IDs is called, but let's assume that it is named ids.

Then you can simply subset the list fastafile like this:

fastafile[names(fastafile) %in% co2$ids]

ADD COMMENT • link 10.8 years ago by Neilfws 49k

0

Entering edit mode

Sorry, the column didn't had a name when I copy from text file.

I have tried it and it works great.

Thanks

ADD REPLY • link 10.8 years ago by catagui ▴ 40

0

Entering edit mode

Hey,

Sorry I am new to R. After entering the line of code:

fastafile[names(fastafile) %in% list$ids

R returns the following:

named list()

Can anyone explain this to me?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by henrikkjeldal ▴ 10

0

Entering edit mode

This doesn't work because the fasta format is a list of the id, ali and call for that set of data. When you try to subset this way, it is trying to find the list of sequences you want in the list that is currently [id, ali, call], so it gives you an empty result (since none of your sequence names match those objects).

I did this instead, creating a new fasta object with the subsetted alignment object (the second object in the fasta list) that I took out of my original fastafile.

newfastafile=fastafile(fastafile$ali[fastafile$id %in% list$ids,])

I was encountering the same problem.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.7 years ago by smendero • 0

0

Entering edit mode

OK so I did this and I get an error that says fastafile is not a function so is fastafile that is outside of the () just supposed to be calling the object created earlier?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.2 years ago by mcintyrejosh66 • 0