I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.
A simple tool would be helpful, but I tried to use pyfasta
and seqtk
to no avail.
Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.
A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.
Any advice how to split a fasta file when the first two characters of the header is >A
so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with A
as the first letter so I don't want to split just on A
- I want to split before the first A
in a series only.
@OP: Good description of data. An example fasta and expected output would be helpful.
Absolutely...
The original file is over 3 million sequences ordered like this:
and I am looking to parse the file like this:
file 1:
file 2:
file 3:
file 4:
and so on...
Note that the sequences are similar but not identical so I can't separate by sequence. I can recluster, but with millions of sequences I am trying to avoid this and quickly parse the file.
I'm fairly certain from the gene clustering that each section of each gene begins with an
A
taxa, but I am not 100% certain.Thanks. In short you would like to break every before A_sp and send it to a new file. @OP