I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner >name
, which in my case is numeric (gi).
Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed?
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
There are no spaces between characters or lines in my file.
linerarize,
sort
using options-k1,1 -u
, move back to fasta usingtr
Is this correct?
Update, the above (from Pierre Lindenbaum) does the job. Very good.
The only thing, I dropped the
-t ' '
and-f
flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of>
, which I manually deleted.