Hi,
I am trying to use cd-hit-est to cluster merged contig files (containing contigs from 22 different metagenome samples), to remove any contigs which are 99% similar to any others. So I am left with a a contig.fasta file containing no duplicates.
I am running cd-hit-est on 4 merged contig files: 1. reads filtered for human sequences using bbmap and then assembled with metaspades 2. reads filtered for human sequences using bbmap and then assembled with spades 3. reads filtered for human sequences using bowtie & samtools and then assembled with metaspades 4. reads filtered for human sequences using bowtie & samtools and then assembled with spades.
This is the code I used:
cd-hit-est -M 100000 -i mergedcontigs.fasta -o merged_cd.fasta -c 0.99 -n 8 -A 0.90
When I run this code on files 1 & 2 everything seems to work fine. But when I run files 3 & 4, the total seq value and the number of sequences being compared are different and capped at 40000. Whereas these numbers were the same for 1 & 2.
Output
----------------------------------------------------------------
total seq: 502084
longest and shortest : 898003 and 11
Total letters: 418459779
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 485M
Buffer : 1 X 358M = 358M
Table : 1 X 9M = 9M
Miscellaneous : 6M
Total : 860M
Table limit with the given memory limit:
Max number of representatives: 40000
Max number of word counting entries: 12392460836
comparing sequences from 0 to 40000
If any one has any idea why this might be I would be grateful!!
Thanks in advance!
have you considered
-s
parameter?