Splitting a large FASTA file into smaller FASTA files by Scaffolds
4
2
Entering edit mode
9.4 years ago

I am working with a large FASTA file that is subdivided into "Scaffolds", similar to this sample:

>Scaffold1
AAGTCCTGACTNNNNNNNNNNNNCGTACGTATCGATCG
>Scaffold2
ACGTAGCTAATCGAGATCGATCNNNNNNNNNNNNNCGATTACGTACGATGTG
>Scaffold3
ACGTATCGTACGTGTANNNNNNNNNNNNCGTACGATGCTACTCACGGATTACAAA

I was wondering if there was a way in R to split the one large FASTA file into smaller FASTA files each containing one Scaffold. Thank you!

alignment FASTA R • 8.0k views
ADD COMMENT
1
Entering edit mode

Any particular reason you want to do this in R? You should always try and use the right tool for the job; in this case something like csplit, perhaps? Certainly not R.

ADD REPLY
5
Entering edit mode
9.4 years ago
$ (sudo) pip install pyfaidx
$ faidx --split-files scaffolds.fa

I know it's not R but I can't recommend an R solution for this with a straight face.

pyfaidx: https://github.com/mdshw5/pyfaidx/

ADD COMMENT
5
Entering edit mode
9.4 years ago

why R?

awk 'BEGIN {O="";} /^>/ { O=sprintf("%s.fa",substr($0,2));} {if(O!="") print >> O;}' input.fa
ADD COMMENT
2
Entering edit mode
9.4 years ago
Steven Lakin ★ 1.8k

While I agree that using a base package is much faster and more convenient, this question is for R. So in case someone searches for this in the future for R, here is a functional solution:

library(iterators)
f <- ireadLines('filepath/inputfile.fasta')
outpath <- 'output/directory/fileprefix' ## don't include the .fasta
count <- 1
while(TRUE) {
        d <- try(nextElem(f))
        if(class(d) == 'try-error') break
        if(grepl(">", d)){
                try(close(outfile), silent=TRUE)
                outfile <- file(paste(c(outpath, count, '.fasta'), collapse=''), open='a')
                write(d, outfile, sep='\n')
                count <- count + 1
        } else {
                write(d, outfile, append=TRUE, sep='\n')
        }
}
rm(f)
closeAllConnections()
ADD COMMENT
0
Entering edit mode
8.9 years ago
polepole40 • 0

Here is my R solution, assuming that you want to split your fasta file by identifier:

library(Biostrings)
scaffolds = readAAStringSet(filepath = "scaffolds.fasta", use.names = T)
t2 = split(scaffolds, names(scaffolds)) 
sapply(names(t2), function (x) writeXStringSet(t2[[x]], filepath = paste0(x,".fasta")))
ADD COMMENT

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6