Removing Redundant Amino Acid Sequences From Fasta - *But Also Give The Groups Of Redundant Acc Numbers*
3
0
Entering edit mode
11.6 years ago
angel.roey • 0

Is there a way to remove redundant amino acid sequences from a fasta file but also output all the redundant accession numbers in groups, just like mothur's unique.seqs command (which unfortunately only works on nucleic acids data). the accession number output should look like this (or similar):

G9SS7BA01AM9A3    G9SS7BA01AM9A3,G9SS7BA01EMTMV,G9SS7BA01CYG40,G9SS7BA01AWI8Z,G9SS7BA01AFVJC,G9SS7BA01BCZCD,G9SS7BA01DBN7B,G9SS7BA01CZ7GO,G9SS7BA01C05FB
G9SS7BA01EAKDX    G9SS7BA01EAKDX,G9SS7BA01B1MNY
G9SS7BA01C2SRQ    G9SS7BA01C2SRQ,G9SS7BA01AK1UJ,G9SS7BA01BLVCZ,G9SS7BA01ARMFA
G9SS7BA01BQ5UG    G9SS7BA01BQ5UG,G9SS7BA01BZ9XF
G9SS7BA01BD4F9    G9SS7BA01BD4F9

Where each row is a group of identical seqs and the first column is the one kept in the 'uniques' file.

USEARCH only outputs a file with the unique seqs.

fasta amino-acids • 4.5k views
ADD COMMENT
3
Entering edit mode
11.6 years ago

You can use CD-HIT (http://weizhong-lab.ucsd.edu/cd-hit/) and parse the resulting cluster file into tab delimited format.

The only problem you might face with CD-HIT is that if your sequence IDs are really long, the cluster output file will shorten the name automatically. You might have rename your fasta files first to a shorter name and then remap the names back afterwards.

ADD COMMENT
2
Entering edit mode

You can change the maximum allowed length of the description in the cd-hit output file with the -d option

ADD REPLY
0
Entering edit mode

Nice. I didn't know that. I guess I should really go through the options.

ADD REPLY
0
Entering edit mode

Thanks! Any possibility of influencing what is being written to the CLSTR file? e.g. only include seqs with redundancy, change format etc.

ADD REPLY
0
Entering edit mode

Doesn't matter, USEARCH does all I need.

ADD REPLY
2
Entering edit mode
11.6 years ago
cts ★ 1.7k

From memory, Usearch should give you the cluster file if you provide it the -uc <FILENAME> option

ADD COMMENT
0
Entering edit mode

Thanks! It does. And in a nicer format than cd-hit

ADD REPLY
0
Entering edit mode
7.8 years ago
Eslam Samir ▴ 110

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENT

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6