Question

Splitting a large FASTA file into smaller FASTA files by Scaffolds

2

Entering edit mode

10.1 years ago

cameron.gudobba ▴ 20

I am working with a large FASTA file that is subdivided into "Scaffolds", similar to this sample:

>Scaffold1
AAGTCCTGACTNNNNNNNNNNNNCGTACGTATCGATCG
>Scaffold2
ACGTAGCTAATCGAGATCGATCNNNNNNNNNNNNNCGATTACGTACGATGTG
>Scaffold3
ACGTATCGTACGTGTANNNNNNNNNNNNCGTACGATGCTACTCACGGATTACAAA

I was wondering if there was a way in R to split the one large FASTA file into smaller FASTA files each containing one Scaffold. Thank you!

alignment FASTA R • 8.6k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by cameron.gudobba ▴ 20

1

Entering edit mode

Any particular reason you want to do this in R? You should always try and use the right tool for the job; in this case something like csplit, perhaps? Certainly not R.

ADD REPLY • link 10.1 years ago by Alexander Skates ▴ 370

Ram · Answer 1 · 2015-07-10

5

Entering edit mode

10.1 years ago

Matt Shirley 10k

$ (sudo) pip install pyfaidx
$ faidx --split-files scaffolds.fa

I know it's not R but I can't recommend an R solution for this with a straight face.

pyfaidx: https://github.com/mdshw5/pyfaidx/

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.1 years ago by Matt Shirley 10k

Ram · Answer 2 · 2015-07-10

5

Entering edit mode

10.1 years ago

Pierre Lindenbaum 166k

why R?

awk 'BEGIN {O="";} /^>/ { O=sprintf("%s.fa",substr($0,2));} {if(O!="") print >> O;}' input.fa

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.1 years ago by Pierre Lindenbaum 166k

Ram · Answer 3 · 2015-07-10

While I agree that using a base package is much faster and more convenient, this question is for R. So in case someone searches for this in the future for R, here is a functional solution:

library(iterators)
f <- ireadLines('filepath/inputfile.fasta')
outpath <- 'output/directory/fileprefix' ## don't include the .fasta
count <- 1
while(TRUE) {
        d <- try(nextElem(f))
        if(class(d) == 'try-error') break
        if(grepl(">", d)){
                try(close(outfile), silent=TRUE)
                outfile <- file(paste(c(outpath, count, '.fasta'), collapse=''), open='a')
                write(d, outfile, sep='\n')
                count <- count + 1
        } else {
                write(d, outfile, append=TRUE, sep='\n')
        }
}
rm(f)
closeAllConnections()

Ram · Answer 4 · 2015-12-17

0

Entering edit mode

9.7 years ago

polepole40 • 0

Here is my R solution, assuming that you want to split your fasta file by identifier:

library(Biostrings)
scaffolds = readAAStringSet(filepath = "scaffolds.fasta", use.names = T)
t2 = split(scaffolds, names(scaffolds)) 
sapply(names(t2), function (x) writeXStringSet(t2[[x]], filepath = paste0(x,".fasta")))

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by polepole40 • 0