Is there a way to remove redundant amino acid sequences from a fasta file but also output all the redundant accession numbers in groups, just like mothur's unique.seqs
command (which unfortunately only works on nucleic acids data).
the accession number output should look like this (or similar):
G9SS7BA01AM9A3 G9SS7BA01AM9A3,G9SS7BA01EMTMV,G9SS7BA01CYG40,G9SS7BA01AWI8Z,G9SS7BA01AFVJC,G9SS7BA01BCZCD,G9SS7BA01DBN7B,G9SS7BA01CZ7GO,G9SS7BA01C05FB
G9SS7BA01EAKDX G9SS7BA01EAKDX,G9SS7BA01B1MNY
G9SS7BA01C2SRQ G9SS7BA01C2SRQ,G9SS7BA01AK1UJ,G9SS7BA01BLVCZ,G9SS7BA01ARMFA
G9SS7BA01BQ5UG G9SS7BA01BQ5UG,G9SS7BA01BZ9XF
G9SS7BA01BD4F9 G9SS7BA01BD4F9
Where each row is a group of identical seqs and the first column is the one kept in the 'uniques' file.
USEARCH only outputs a file with the unique seqs.
You can change the maximum allowed length of the description in the cd-hit output file with the
-d
optionNice. I didn't know that. I guess I should really go through the options.
Thanks! Any possibility of influencing what is being written to the CLSTR file? e.g. only include seqs with redundancy, change format etc.
Doesn't matter, USEARCH does all I need.