Remove Duplicate Peptide Sequences And Merge Headers from Duplicate Sequences
0
0
Entering edit mode
23 months ago

I have a set of proteoform sequences that originate from many proteins from the same organism in fasta format. I am wanting to effectively remove the duplicates at 100% similarity but I would like to merge the headers from those duplicate sequences. What is an efficient way to accomplish this without having to spend hours parsing cd-hit .clstr files?

A tool that I see performs similar to this function is the mothur cluster.fragments() function, but, this has not been implemented for protein sequences.

For example I clustered the fasta file like so with cd-hit:

cd-hit -i peptchains.fasta -o peptchains_derep.fasta -c 1.00 -G 0 -aL 1 -AL 0 -aS 1 -AS 0 -n 5 -S 0 -M 600000 -T 0 -d 100

Parsing this output and merging these headers from the .clstr file is not straight forward from here.

protein header duplicate peptides sequence • 530 views
ADD COMMENT

Login before adding your answer.

Traffic: 2684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6