Hi! I'm very new to bioinformatics and was wondering if someone could help me.
I am trying to split a multi-fasta file (1.5 millions of contigs) into smaller files, which would contain each 1000 sequences (except for the last one, probably). My goal is to BLASTn the smaller files after. The thing is that I cannot use "split" in Unix directly on the .fasta file, since each sequence has a different lenght. If I use "split", it breaks the sequence in the middle.
I have the option of putting all the sequences in one line so I can split them after. I managed to do it, however, I really want to learn more about coding (I feel like this would be the "easy option").
My supervisor has told me that I could do that in R with a for loop. I have read my data using the package ape so the .fasta file is now a DNAbin. When I do length(mydata)
, the result is the number of sequences. I have made a try by doing and it gave me my first two sequences. So we can assume that for now, each line = one sequence.
x <- mydata[1:2]
Basically, I want that everytime we see a multiple of 1000 (1000, 2000, 3000, 4000, ...) R writes the lines from 0:1000 into one file, the lines from 1001:2000 in another file, ... I just started recently with Python and with R, so I've been trying a few things but nothing is working.
Is that something doable in R?
Thank you in advance!
Thank you for your answer!!