I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.
>gene1
ACTG...
>gene2
ATTT...
...
>gene2,700,000
GCAC...
The way I do it is;
grep -n "^>" my.fasta > headersofmy.fasta
This gives me the positional information of the headers.
1:>gene1
4:>gene2
11:>gene3
...
n:>gene2,700,000
I then use the positional information to grab a set number of genes;
awk 'NR>=position1&&NR<=position2' my.fasta > set1.fasta
I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.
I broke it first in chunks of 500,000 headers then to 100,000.
I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.
Any suggestion on how to approach this?
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Hello sicat.paolo20 ,
There are multiple answers posted below. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer, if they work.
Sorry I was occupied with another issue and forgot to check my account again.