Question

Split multi-fasta file into smaller files using R

0

Entering edit mode

5.2 years ago

slushverte01 • 0

Hi! I'm very new to bioinformatics and was wondering if someone could help me.

I am trying to split a multi-fasta file (1.5 millions of contigs) into smaller files, which would contain each 1000 sequences (except for the last one, probably). My goal is to BLASTn the smaller files after. The thing is that I cannot use "split" in Unix directly on the .fasta file, since each sequence has a different lenght. If I use "split", it breaks the sequence in the middle.

I have the option of putting all the sequences in one line so I can split them after. I managed to do it, however, I really want to learn more about coding (I feel like this would be the "easy option").

My supervisor has told me that I could do that in R with a for loop. I have read my data using the package ape so the .fasta file is now a DNAbin. When I do length(mydata), the result is the number of sequences. I have made a try by doing and it gave me my first two sequences. So we can assume that for now, each line = one sequence.

x <- mydata[1:2]

Basically, I want that everytime we see a multiple of 1000 (1000, 2000, 3000, 4000, ...) R writes the lines from 0:1000 into one file, the lines from 1001:2000 in another file, ... I just started recently with Python and with R, so I've been trying a few things but nothing is working.

Is that something doable in R?

Thank you in advance!

R multi-fasta fasta blastn sequences • 3.6k views

ADD COMMENT • link updated 5.2 years ago by Juke34 9.3k • written 5.2 years ago by slushverte01 • 0

score 2 · Answer 1 · 2020-06-26

2

Entering edit mode

5.2 years ago

GenoMax 153k

If using an alternate to R is an option then there are many other ways of doing this. faSplit utility by Jim Kent (UCSC) is likely the most performant option. Search for split fasta files site:biostars.org.

ADD COMMENT • link 5.2 years ago by GenoMax 153k

0

Entering edit mode

Thank you for your answer!!

ADD REPLY • link 5.2 years ago by slushverte01 • 0

score 0 · Answer 2 · 2020-06-27

hi. what you want to do is very easy with python and biopython. if ur supervisor said to use a for loop in R then they clearly aren't a Bioinformatician - avoid iterating in R.

I'm using one hand on a phone so this wont be syntacticly correct but it should get u close. there are faster ways, like seqtk (or just basic py), but this is teaching u some biopython, which comes in handy for little tasks like this.

from Bio import SeqIO

for i, fa in enumerate(SeqIO.parse('seqs.fa', 'fasta')):
    if i%1000==0:
        if i==0:
            fout = open(str(i)+'out.fa', 'w')
        else:
            fout.close()
            fout = open(str(i)+'out.fa', 'w')
    SeqIO.write(fa, fout, 'fasta')

fout.close()

I'm sure there's better ways but that's the first thing that came to mind. and sorry if hard to read. phone making it hard

score 0 · Answer 3 · 2020-06-27

0

Entering edit mode

5.2 years ago

Juke34 9.3k

Several solutions (None in R) documented in this mini-review

ADD COMMENT • link 5.2 years ago by Juke34 9.3k

0

Entering edit mode

Thank you for your answer!! I'll look into that this week!

ADD REPLY • link 5.2 years ago by slushverte01 • 0